[ceph-users] Musings

2014-08-14 Thread Robert LeBlanc
We are looking to deploy Ceph in our environment and I have some musings
that I would like some feedback on. There are concerns about scaling a
single Ceph instance to the PBs of size we would use, so the idea is to
start small like once Ceph cluster per rack or two. Then as we feel more
comfortable with it, then expand/combine clusters into larger systems. I'm
not sure that it is possible to combine discrete Ceph clusters. It also
seems to make sense to build a CRUSH map that defines regions, data
centers, sections, rows, racks, and hosts now so that there is less data
migration later, but I'm not sure how a merge would work.

I've been also toying with the idea of SSD journal per node verses SSD
cache tier pool verses lots of RAM for cache. Based on the performance
webinar today, it seems that cache misses in the cache pool causes a lot of
writing to the cache pool and severely degrades performance. I certainly
like the idea of a heat map that way a single read of an entire VM (backup,
rsync) won't kill the cache pool.

I've also been bouncing the idea to have data locality by configuring the
CRUSH map to keep two of the three replicas within the same row and the
third replica just somewhere in the data center. Based on a conversation on
the IRC a couple of days ago, it seems that this could work very will if
min_size is 2. But the documentation and the objective of Ceph seems to
indicate that min_size only applies in degraded situations. During normal
operation a write would have to be acknowledged by all three replicas
before being returned to the client, otherwise it would be eventually
consistent and not strongly consistent (I do like the idea of eventually
consistent for replication as long as we can be strongly consistent in some
form at the same time like 2 out of 3).

I've read through the online manual, so now I'm looking for personal
perspectives that you may have.

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.84 released

2014-08-18 Thread Robert LeBlanc
This may be a better question for Federico. I've pulled the systemd stuff
from git and I have it working, but only if I have the volumes listed in
fstab. Is this the intended way that systemd will function for now or am I
missing a step? I'm pretty new to systemd.

Thanks,
Robert LeBlanc


On Mon, Aug 18, 2014 at 1:14 PM, Sage Weil  wrote:

> The next Ceph development release is here!  This release contains several
> meaty items, including some MDS improvements for journaling, the ability
> to remove the CephFS file system (and name it), several mon cleanups with
> tiered pools, several OSD performance branches, a new "read forward" RADOS
> caching mode, a prototype Kinetic OSD backend, and various radosgw
> improvements (especially with the new standalone civetweb frontend).  And
> there are a zillion OSD bug fixes. Things are looking pretty good for the
> Giant release that is coming up in the next month.
>
> Upgrading
> -
>
> * The *_kb perf counters on the monitor have been removed.  These are
>   replaced with a new set of *_bytes counters (e.g., cluster_osd_kb is
>   replaced by cluster_osd_bytes).
>
> * The rd_kb and wr_kb fields in the JSON dumps for pool stats (accessed via
>   the 'ceph df detail -f json-pretty' and related commands) have been
> replaced
>   with corresponding *_bytes fields.  Similarly, the 'total_space',
> 'total_used',
>   and 'total_avail' fields are replaced with 'total_bytes',
>   'total_used_bytes', and 'total_avail_bytes' fields.
>
> * The 'rados df --format=json' output 'read_bytes' and 'write_bytes'
>   fields were incorrectly reporting ops; this is now fixed.
>
> * The 'rados df --format=json' output previously included 'read_kb' and
>   'write_kb' fields; these have been removed.  Please use 'read_bytes' and
>   'write_bytes' instead (and divide by 1024 if appropriate).
>
> Notable Changes
> ---
>
> * ceph-conf: flush log on exit (Sage Weil)
> * ceph-dencoder: refactor build a bit to limit dependencies (Sage Weil,
>   Dan Mick)
> * ceph.spec: split out ceph-common package, other fixes (Sandon Van Ness)
> * ceph_test_librbd_fsx: fix RNG, make deterministic (Ilya Dryomov)
> * cephtool: refactor and improve CLI tests (Joao Eduardo Luis)
> * client: improved MDS session dumps (John Spray)
> * common: fix dup log messages (#9080, Sage Weil)
> * crush: include new tunables in dump (Sage Weil)
> * crush: only require rule features if the rule is used (#8963, Sage Weil)
> * crushtool: send output to stdout, not stderr (Wido den Hollander)
> * fix i386 builds (Sage Weil)
> * fix struct vs class inconsistencies (Thorsten Behrens)
> * hadoop: update hadoop tests for Hadoop 2.0 (Haumin Chen)
> * librbd, ceph-fuse: reduce cache flush overhead (Haomai Wang)
> * librbd: fix error path when opening image (#8912, Josh Durgin)
> * mds: add file system name, enabled flag (John Spray)
> * mds: boot refactor, cleanup (John Spray)
> * mds: fix journal conversion with standby-replay (John Spray)
> * mds: separate inode recovery queue (John Spray)
> * mds: session ls, evict commands (John Spray)
> * mds: submit log events in async thread (Yan, Zheng)
> * mds: use client-provided timestamp for user-visible file metadata (Yan,
>   Zheng)
> * mds: validate journal header on load and save (John Spray)
> * misc build fixes for OS X (John Spray)
> * misc integer size cleanups (Kevin Cox)
> * mon: add get-quota commands (Joao Eduardo Luis)
> * mon: do not create file system by default (John Spray)
> * mon: fix 'ceph df' output for available space (Xiaoxi Chen)
> * mon: fix bug when no auth keys are present (#8851, Joao Eduardo Luis)
> * mon: fix compat version for MForward (Joao Eduardo Luis)
> * mon: restrict some pool properties to tiered pools (Joao Eduardo Luis)
> * msgr: misc locking fixes for fast dispatch (#8891, Sage Weil)
> * osd: add 'dump_reservations' admin socket command (Sage Weil)
> * osd: add READFORWARD caching mode (Luis Pabon)
> * osd: add header cache for KeyValueStore (Haomai Wang)
> * osd: add prototype KineticStore based on Seagate Kinetic (Josh Durgin)
> * osd: allow map cache size to be adjusted at runtime (Sage Weil)
> * osd: avoid refcounting overhead by passing a few things by ref (Somnath
>   Roy)
> * osd: avoid sharing PG info that is not durable (Samuel Just)
> * osd: clear slow request latency info on osd up/down (Sage Weil)
> * osd: fix PG object listing/ordering bug (Guang Yang)
> * osd: fix PG stat errors with tiering (#9082, Sage Weil)
> * osd: fix bug with long object names and rename (#8701, Sage Weil)
> * osd: fix 

Re: [ceph-users] v0.84 released

2014-08-19 Thread Robert LeBlanc
OK, I don't think the udev rules are on my machines. I built the cluster
manually and not with ceph-deploy. I must have missed adding the rules in
the manual or the Packages from Debian (Jessie) did not create them.

Robert LeBlanc


On Mon, Aug 18, 2014 at 5:49 PM, Sage Weil  wrote:

> On Mon, 18 Aug 2014, Robert LeBlanc wrote:
> > This may be a better question for Federico. I've pulled the systemd stuff
> > from git and I have it working, but only if I have the volumes listed in
> > fstab. Is this the intended way that systemd will function for now or am
> I
> > missing a step? I'm pretty new to systemd.
>
> The OSDs are normally mounted and started via udev, which will call
> 'ceph-disk activate '.  The missing piece is teaching ceph-disk
> how to start up the systemd service for the OSD.  I suspect that this can
> be completely dynamic, based on udev events, not not using 'enable' thing
> where systemd persistently registers that a service is to be started...?
>
> sage
>
>
>
>
> > Thanks,
> > Robert LeBlanc
> >
> >
> > On Mon, Aug 18, 2014 at 1:14 PM, Sage Weil  wrote:
> >   The next Ceph development release is here!  This release
> >   contains several
> >   meaty items, including some MDS improvements for journaling, the
> >   ability
> >   to remove the CephFS file system (and name it), several mon
> >   cleanups with
> >   tiered pools, several OSD performance branches, a new "read
> >   forward" RADOS
> >   caching mode, a prototype Kinetic OSD backend, and various
> >   radosgw
> >   improvements (especially with the new standalone civetweb
> >   frontend).  And
> >   there are a zillion OSD bug fixes. Things are looking pretty
> >   good for the
> >   Giant release that is coming up in the next month.
> >
> >   Upgrading
> >   -
> >
> >   * The *_kb perf counters on the monitor have been removed.
> >   These are
> > replaced with a new set of *_bytes counters (e.g.,
> >   cluster_osd_kb is
> > replaced by cluster_osd_bytes).
> >
> >   * The rd_kb and wr_kb fields in the JSON dumps for pool stats
> >   (accessed via
> > the 'ceph df detail -f json-pretty' and related commands) have
> >   been replaced
> > with corresponding *_bytes fields.  Similarly, the
> >   'total_space', 'total_used',
> > and 'total_avail' fields are replaced with 'total_bytes',
> > 'total_used_bytes', and 'total_avail_bytes' fields.
> >
> >   * The 'rados df --format=json' output 'read_bytes' and
> >   'write_bytes'
> > fields were incorrectly reporting ops; this is now fixed.
> >
> >   * The 'rados df --format=json' output previously included
> >   'read_kb' and
> > 'write_kb' fields; these have been removed.  Please use
> >   'read_bytes' and
> > 'write_bytes' instead (and divide by 1024 if appropriate).
> >
> >   Notable Changes
> >   ---
> >
> >   * ceph-conf: flush log on exit (Sage Weil)
> >   * ceph-dencoder: refactor build a bit to limit dependencies
> >   (Sage Weil,
> > Dan Mick)
> >   * ceph.spec: split out ceph-common package, other fixes (Sandon
> >   Van Ness)
> >   * ceph_test_librbd_fsx: fix RNG, make deterministic (Ilya
> >   Dryomov)
> >   * cephtool: refactor and improve CLI tests (Joao Eduardo Luis)
> >   * client: improved MDS session dumps (John Spray)
> >   * common: fix dup log messages (#9080, Sage Weil)
> >   * crush: include new tunables in dump (Sage Weil)
> >   * crush: only require rule features if the rule is used (#8963,
> >   Sage Weil)
> >   * crushtool: send output to stdout, not stderr (Wido den
> >   Hollander)
> >   * fix i386 builds (Sage Weil)
> >   * fix struct vs class inconsistencies (Thorsten Behrens)
> >   * hadoop: update hadoop tests for Hadoop 2.0 (Haumin Chen)
> >   * librbd, ceph-fuse: reduce cache flush overhead (Haomai Wang)
> >   * librbd: fix error path when opening image (#8912, Josh Durgin)
> >   * mds: add file system name, enabled flag (John Spray)
> >   * mds: boot refactor, cleanup (John Spray)
> >   * mds: fix journal conversion with standby-replay

Re: [ceph-users] v0.84 released

2014-08-19 Thread Robert LeBlanc
Thanks Sage, I was looking in /etc/udev/rules.d (duh!). If I'm reading the
rules right, my problem has to do with putting Ceph on the entire block
device and not setting up a partition (bad habit from LVM). This will give
me some practice with failing and rebuilding OSDs. If I understand right, a
udev-trigger should mount and activate the OSD, and I won't have to
manually run the init.d script?

Thanks,
Robert LeBlanc


On Tue, Aug 19, 2014 at 9:21 AM, Sage Weil  wrote:

> On Tue, 19 Aug 2014, Robert LeBlanc wrote:
> > OK, I don't think the udev rules are on my machines. I built the cluster
> > manually and not with ceph-deploy. I must have missed adding the rules in
> > the manual or the Packages from Debian (Jessie) did not create them.
>
> They are normally part of the ceph package:
>
> $ dpkg -L ceph | grep udev
> /lib/udev
> /lib/udev/rules.d
> /lib/udev/rules.d/60-ceph-partuuid-workaround.rules
> /lib/udev/rules.d/95-ceph-osd.rules
>
> sage
>
>
> > Robert LeBlanc
>
> >
> >
> > On Mon, Aug 18, 2014 at 5:49 PM, Sage Weil  wrote:
> >   On Mon, 18 Aug 2014, Robert LeBlanc wrote:
> >   > This may be a better question for Federico. I've pulled the
> >   systemd stuff
> >   > from git and I have it working, but only if I have the volumes
> >   listed in
> >   > fstab. Is this the intended way that systemd will function for
> >   now or am I
> >   > missing a step? I'm pretty new to systemd.
> >
> > The OSDs are normally mounted and started via udev, which will call
> > 'ceph-disk activate '.  The missing piece is teaching
> > ceph-disk
> > how to start up the systemd service for the OSD.  I suspect that this
> > can
> > be completely dynamic, based on udev events, not not using 'enable'
> > thing
> > where systemd persistently registers that a service is to be
> > started...?
> >
> > sage
> >
> >
> >
> >
> > > Thanks,
> > > Robert LeBlanc
> > >
> > >
> > > On Mon, Aug 18, 2014 at 1:14 PM, Sage Weil  wrote:
> > >   The next Ceph development release is here!  This release
> > >   contains several
> > >   meaty items, including some MDS improvements for journaling,
> > the
> > >   ability
> > >   to remove the CephFS file system (and name it), several mon
> > >   cleanups with
> > >   tiered pools, several OSD performance branches, a new "read
> > >   forward" RADOS
> > >   caching mode, a prototype Kinetic OSD backend, and various
> > >   radosgw
> > >   improvements (especially with the new standalone civetweb
> > >   frontend).  And
> > >   there are a zillion OSD bug fixes. Things are looking pretty
> > >   good for the
> > >   Giant release that is coming up in the next month.
> > >
> > >   Upgrading
> > >   -
> > >
> > >   * The *_kb perf counters on the monitor have been removed.
> > >   These are
> > > replaced with a new set of *_bytes counters (e.g.,
> > >   cluster_osd_kb is
> > > replaced by cluster_osd_bytes).
> > >
> > >   * The rd_kb and wr_kb fields in the JSON dumps for pool stats
> > >   (accessed via
> > > the 'ceph df detail -f json-pretty' and related commands)
> > have
> > >   been replaced
> > > with corresponding *_bytes fields.  Similarly, the
> > >   'total_space', 'total_used',
> > > and 'total_avail' fields are replaced with 'total_bytes',
> > > 'total_used_bytes', and 'total_avail_bytes' fields.
> > >
> > >   * The 'rados df --format=json' output 'read_bytes' and
> > >   'write_bytes'
> > > fields were incorrectly reporting ops; this is now fixed.
> > >
> > >   * The 'rados df --format=json' output previously included
> > >   'read_kb' and
> > > 'write_kb' fields; these have been removed.  Please use
> > >   'read_bytes' and
> > > 'write_bytes' instead (and divide by 1024 if appropriate).
> > >
> > >   Notable Changes
> > >   ---
> > >
> > >   * ceph-conf: flush log on exit (Sage Weil)
> > > 

Re: [ceph-users] Musings

2014-08-19 Thread Robert LeBlanc
Greg, thanks for the reply, please see in-line.


On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum  wrote:

>
> There are many groups running cluster >1PB, but whatever makes you
> comfortable. There is a bit more of a learning curve once you reach a
> certain scale than there is with smaller installations.
>

What do you find to be the most difficult issues at large scale? It may
help ease some of the concerns if we know what we can expect.

Yeah, there's no merging of Ceph clusters and I don't think there ever
> will be. Setting up the CRUSH maps this way to start, and only having
> a single entry for most of the levels, would work just fine though.
>

Thanks for confirming my suspicions. If we start with a CRUSH map designed
well, we can probably migrate the data outside of Ceph and just grow one
system and as the other empy, reformat them and bring them in.

Yeah, there is very little world Ceph experience with cache pools, and
> there's a lot working with an SSD journal + hard drive backing store;
> I'd start with that.
>

Other thoughts are using something like bcache or dm-cache on each OSD.
bcache is tempting because a single SSD device can serve multiple disks
where dm-cache has to have a separate SSD device/partition for each disk
(plus metadata). I plan on testing this unless someone says that it is
absolutely not worth the time.


> Yeah, no async replication at all for generic workloads. You can do
> the "2 my rack and one in a different rack" thing just fine, although
> it's a little tricky to set up. (There are email threads about this
> that hopefully you can find; I've been part of one of them.) The
> min_size is all about preserving a minimum resiliency of *every* write
> (if a PG's replication is degraded but not yet repaired); if you had a
> 2+1 setup then min_size of 2 would just make sure there are at least
> two copies somewhere (but not that they're in different racks or
> whatever).
>

The current discussion in the office is if the cluster (2+1) is HEALTHY,
does the write return after 2 of the OSDs (itself and one replica) complete
the write or only after all three have completed the write? We are planning
to try to do some testing on this as well if a clear answer can't be found.

Thank you,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Musings

2014-08-19 Thread Robert LeBlanc
Thanks, your responses have been helpful.


On Tue, Aug 19, 2014 at 1:48 PM, Gregory Farnum  wrote:

> On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc 
> wrote:
> > Greg, thanks for the reply, please see in-line.
> >
> >
> > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum 
> wrote:
> >>
> >>
> >> There are many groups running cluster >1PB, but whatever makes you
> >> comfortable. There is a bit more of a learning curve once you reach a
> >> certain scale than there is with smaller installations.
> >
> >
> > What do you find to be the most difficult issues at large scale? It may
> help
> > ease some of the concerns if we know what we can expect.
>
> Well, I'm a developer, not a maintainer, so I'm probably the wrong
> person to ask about what surprises people. But in general it's stuff
> like:
> 1) Tunable settings matter more
> 2) Behavior that was unfortunate but left the cluster alive in a small
> cluster (eg, you have a bunch of slow OSDs that keep flapping) could
> turn into a data non-availability event in a large one (because with
> that many more OSDs misbehaving it overwhelms the monitors or
> something)
> 3) Resource consumption limits start popping up (eg, fd and pid limits
> need to be increased)
>
> Things like that. These are generally a matter of admin education at
> this scale (the code issues are fairly well sorted-out by now,
> although there were plenty of those to be found on the first
> multi-petabyte-scale cluster).
>
> >
> >> Yeah, there's no merging of Ceph clusters and I don't think there ever
> >> will be. Setting up the CRUSH maps this way to start, and only having
> >> a single entry for most of the levels, would work just fine though.
> >
> >
> > Thanks for confirming my suspicions. If we start with a CRUSH map
> designed
> > well, we can probably migrate the data outside of Ceph and just grow one
> > system and as the other empy, reformat them and bring them in.
> >
> >> Yeah, there is very little world Ceph experience with cache pools, and
> >> there's a lot working with an SSD journal + hard drive backing store;
> >> I'd start with that.
> >
> >
> > Other thoughts are using something like bcache or dm-cache on each OSD.
> > bcache is tempting because a single SSD device can serve multiple disks
> > where dm-cache has to have a separate SSD device/partition for each disk
> > (plus metadata). I plan on testing this unless someone says that it is
> > absolutely not worth the time.
> >
> >>
> >> Yeah, no async replication at all for generic workloads. You can do
> >> the "2 my rack and one in a different rack" thing just fine, although
> >> it's a little tricky to set up. (There are email threads about this
> >> that hopefully you can find; I've been part of one of them.) The
> >> min_size is all about preserving a minimum resiliency of *every* write
> >> (if a PG's replication is degraded but not yet repaired); if you had a
> >> 2+1 setup then min_size of 2 would just make sure there are at least
> >> two copies somewhere (but not that they're in different racks or
> >> whatever).
> >
> >
> > The current discussion in the office is if the cluster (2+1) is HEALTHY,
> > does the write return after 2 of the OSDs (itself and one replica)
> complete
> > the write or only after all three have completed the write? We are
> planning
> > to try to do some testing on this as well if a clear answer can't be
> found.
>
> It's only after all three have completed the write. Every write to
> Ceph is replicated synchronously to every OSD which is actively
> hosting the PG that the object resides in.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Robert LeBlanc
My understanding is that all reads are copied to the cache pool. This would
indicate that cache will be evicted. I don't know to what extent this will
affect the hot cache because we have not used a cache pool yet. I'm
currently looking into bcache fronting the disks to provide caching there.

Robert LeBlanc


On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky 
wrote:

> Hello guys,
>
> I am planning to perform regular rbd pool off-site backup with rbd export
> and export-diff. I've got a small ceph firefly cluster with an active
> writeback cache pool made of couple of osds. I've got the following
> question which I hope the ceph community could answer:
>
> Will this rbd export or import operations affect the active hot data in
> the cache pool, thus evicting from the cache pool the real hot data used by
> the clients. Or does the process of rbd export/import effect only the osds
> and does not touch the cache pool?
>
> Many thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Robert LeBlanc
In the performance webinar last week (it is online for viewing), they
mentioned that they are looking a ways to prevent single reads from
entering cache or other optimizations. From what I understand it is still a
very new feature so I'm sure it will see some good improvements.


On Fri, Aug 22, 2014 at 3:13 PM, Andrei Mikhailovsky 
wrote:

> So it looks like using rbd export / import will negatively effect the
> client performance, which is unfortunate. Is this really the case? Any
> plans on changing this behavior in future versions of ceph?
>
> Cheers
>
> Andrei
>
>
> ----- Original Message -
> From: "Robert LeBlanc" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 22 August, 2014 8:21:08 PM
> Subject: Re: [ceph-users] pool with cache pool and rbd export
>
>
> My understanding is that all reads are copied to the cache pool. This
> would indicate that cache will be evicted. I don't know to what extent this
> will affect the hot cache because we have not used a cache pool yet. I'm
> currently looking into bcache fronting the disks to provide caching there.
>
>
> Robert LeBlanc
>
>
>
> On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com
> > wrote:
>
>
> Hello guys,
>
> I am planning to perform regular rbd pool off-site backup with rbd export
> and export-diff. I've got a small ceph firefly cluster with an active
> writeback cache pool made of couple of osds. I've got the following
> question which I hope the ceph community could answer:
>
> Will this rbd export or import operations affect the active hot data in
> the cache pool, thus evicting from the cache pool the real hot data used by
> the clients. Or does the process of rbd export/import effect only the osds
> and does not touch the cache pool?
>
> Many thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Robert LeBlanc
I believe the scrubbing happens at the pool level, when the backend pool is
scrubbed it is independent of the cache pool. It would be nice to get some
definite answers from someone who knows a lot more.

Robert LeBlanc


On Fri, Aug 22, 2014 at 3:16 PM, Andrei Mikhailovsky 
wrote:

> Does that also mean that scrubbing and deep-scrubbing also squishes data
> out of the cache pool? Could someone from the ceph community confirm this?
>
> Thanks
>
>
> - Original Message -
> From: "Robert LeBlanc" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 22 August, 2014 8:21:08 PM
> Subject: Re: [ceph-users] pool with cache pool and rbd export
>
>
> My understanding is that all reads are copied to the cache pool. This
> would indicate that cache will be evicted. I don't know to what extent this
> will affect the hot cache because we have not used a cache pool yet. I'm
> currently looking into bcache fronting the disks to provide caching there.
>
>
> Robert LeBlanc
>
>
>
> On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com
> > wrote:
>
>
> Hello guys,
>
> I am planning to perform regular rbd pool off-site backup with rbd export
> and export-diff. I've got a small ceph firefly cluster with an active
> writeback cache pool made of couple of osds. I've got the following
> question which I hope the ceph community could answer:
>
> Will this rbd export or import operations affect the active hot data in
> the cache pool, thus evicting from the cache pool the real hot data used by
> the clients. Or does the process of rbd export/import effect only the osds
> and does not touch the cache pool?
>
> Many thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Prioritize Heartbeat packets

2014-08-27 Thread Robert LeBlanc
I'm looking for a way to prioritize the heartbeat traffic higher than the
storage and replication traffic. I would like to keep the ceph.conf as
simple as possible by not adding the individual osd IP addresses and ports,
but it looks like the listening ports are pretty random. I'd like to use
iptables to mark the packet with DSCP, then our network team can mark it
with COS. I'm having a hard time figuring out what in the payload I can
match to say that it is a heartbeat packet.

We have some concerns that under high network congestion scenarios,
heartbeats will be lost, marking OSDs down, triggering replication, causing
more traffic causing a compounding to the congestion, marking more OSDs
down until the entire cluster falls apart. Our intention is that we can try
to deliver heartbeat packets first with the hope that the system will know
that OSDs are up, just busy. Is our reasoning sound in this regard?

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize Heartbeat packets

2014-08-27 Thread Robert LeBlanc
On Wed, Aug 27, 2014 at 4:15 PM, Sage Weil  wrote:

> On Wed, 27 Aug 2014, Robert LeBlanc wrote:
> > I'm looking for a way to prioritize the heartbeat traffic higher than the
> > storage and replication traffic. I would like to keep the ceph.conf as
> > simple as possible by not adding the individual osd IP addresses and
> ports,
> > but it looks like the listening ports are pretty random. I'd like to use
> > iptables to mark the packet with DSCP, then our network team can mark it
> > with COS. I'm having a hard time figuring out what in the payload I can
> > match to say that it is a heartbeat packet.
>
> What would be best way for us to mark which sockets are heartbeat related?
> Is there some setsockopt() type call we should be using, or should we
> perhaps use a different port range for heartbeat traffic?
>

I'm not really sure setting up a separate port range is the correct answer,
but it could be a very simple implementation, something like ports 6900+
and would be very simple itpables rules. Based on my packet capture, it
seems that heartbeat traffic is less than 150 bytes and the payload always
starts with 0x08, I'm just not sure how true that is. I could build an
iptables rule off of that.

> We have some concerns that under high network congestion scenarios,
> > heartbeats will be lost, marking OSDs down, triggering replication,
> causing
> > more traffic causing a compounding to the congestion, marking more OSDs
> down
> > until the entire cluster falls apart. Our intention is that we can try to
> > deliver heartbeat packets first with the hope that the system will know
> that
> > OSDs are up, just busy. Is our reasoning sound in this regard?
>
> It is reasonable.  We do need to be careful, however.  The heartbeats are
> attempting to measure whether the OSDs are able to effectively
> communicate.  Every time we try to give heartbeats a different level of
> service, we are diminishing their ability to measure that reality.  For
> example, if your QoS starved or nearly starved non-heartbeat traffic the
> OSDs wouldn't be able to tell.
>

>From what I understand the heartbeat traffic is passed over both the public
and cluster networks now, so congestion on one network would not
necessarily cause an OSD to be marked down. Giving heartbeats higher
priority will actually confirm that OSDs can talk (albeit a bit slower)
instead of incorrectly assuming that they are dead and need to be taken
over.

sage
>

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize Heartbeat packets

2014-08-27 Thread Robert LeBlanc
Interesting concept. What if this was extended to an external message bus
system like RabbitMQ, ZeroMQ, etc?

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Aug 27, 2014 7:34 PM, "Matt W. Benjamin"  wrote:

> Hi,
>
> I wasn't thinking of an interface to mark sockets directly (didn't know
> one existed at the socket interface), rather something we might maintain,
> perhaps a query interface on the server, or perhaps DBUS, etc.
>
> Matt
>
> - "Sage Weil"  wrote:
>
> > On Wed, 27 Aug 2014, Matt W. Benjamin wrote:
> > >
> > > - "Sage Weil"  wrote:
> > > >
> > > > What would be best way for us to mark which sockets are heartbeat
> > > > related?
> > > > Is there some setsockopt() type call we should be using, or should
> > we
> > > >
> > > > perhaps use a different port range for heartbeat traffic?
> > >
> > > Would be be plausible to have hb messengers identify themselves to a
> > bus as such,
> > > that external tools (here, the ts scripts) could introspect?
> >
> > What do you mean by bus in this case?
> >
> > I seem to remember someone telling me there were hooks/hints you could
> >
> > call that would tag either a socket or possibly data on that socket
> > with a
> > label for use by iptables and such.. but I forget what it was.
> >
> > sage
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
>
> http://linuxbox.com
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven OSD usage

2014-08-28 Thread Robert LeBlanc
How many PGs do you have in your pool? This should be about 100/OSD. If it
is too low, you could get an imbalance. I don't know the consequence of
changing it on such a full cluster. The default values are only good for
small test environments.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Aug 28, 2014 11:00 AM, "J David"  wrote:

> Hello,
>
> Is there any way to provoke a ceph cluster to level out its OSD usage?
>
> Currently, a cluster of 3 servers with 4 identical OSDs each is
> showing disparity of about 20% between the most-used OSD and the
> least-used OSD.  This wouldn't be too big of a problem, but the
> most-used OSD is now at 86% (with the least-used at 72%).
>
> There are three more nodes on order but they are a couple of weeks
> away.  Is there anything I can do in the mean time to push existing
> data (and new data) toward less-used OSD's?
>
> Reweighting the OSD's feels intuitively like the wrong approach since
> they are all the same size and "should" have the same weight.  Is that
> the wrong intuition?
>
> Also, with a test cluster, I did try playing around with
> reweight-by-utilization and it actually seemed to make things worse.
> But that cluster was assembled from spare parts and the OSD's were
> neither all the same size nor were they uniformly distributed between
> servers.  This is *not* a test cluster, so I am gun-shy about possibly
> making things worse.
>
> Is reweight-by-utilization the right point to poke this?  Or is there
> a better tool in the toolbox for this situation?
>
> Here is the OSD tree showing that everything is weighted equally:
>
> # id weight type name up/down reweight
> -1 4.2 root default
> -2 1.4 host f13
> 0 0.35 osd.0 up 1
> 1 0.35 osd.1 up 1
> 2 0.35 osd.2 up 1
> 3 0.35 osd.3 up 1
> -3 1.4 host f14
> 4 0.35 osd.4 up 1
> 9 0.35 osd.9 up 1
> 10 0.35 osd.10 up 1
> 11 0.35 osd.11 up 1
> -4 1.4 host f15
> 5 0.35 osd.5 up 1
> 6 0.35 osd.6 up 1
> 7 0.35 osd.7 up 1
> 8 0.35 osd.8 up 1
>
> And the df's of each:
>
> Node 1:
>
> /dev/sda2   358G  258G
> 101G  72% /var/lib/ceph/osd/ceph-0
> /dev/sdb2   358G  294G
> 65G  82% /var/lib/ceph/osd/ceph-1
> /dev/sdc2   358G  278G
> 81G  78% /var/lib/ceph/osd/ceph-2
> /dev/sdd2   358G  294G
> 65G  83% /var/lib/ceph/osd/ceph-3
>
> Node 2:
>
> /dev/sda2   358G  285G
> 73G  80% /var/lib/ceph/osd/ceph-5
> /dev/sdb2   358G  305G
> 53G  86% /var/lib/ceph/osd/ceph-6
> /dev/sdc2   358G  301G
> 58G  85% /var/lib/ceph/osd/ceph-7
> /dev/sdd2   358G  299G
> 60G  84% /var/lib/ceph/osd/ceph-8
>
> Node 3:
>
> /dev/sda2   358G  290G
> 68G  82% /var/lib/ceph/osd/ceph-4
> /dev/sdb2   358G  297G
> 62G  83% /var/lib/ceph/osd/ceph-11
> /dev/sdc2   358G  285G
> 73G  80% /var/lib/ceph/osd/ceph-10
> /dev/sdd2   358G  306G
> 53G  86% /var/lib/ceph/osd/ceph-9
>
> Ideally we would like to get about 125 gigs more data (with num of
> replicas set to 2) onto this pool before the additional nodes arrive,
> which would put *everything* at about 86% if everything were evenly
> balanced.  But the way it's currently going, that'll have the busiest
> OSD dangerously close to 95%.  (Apparently data increases faster than
> you expect, even if you account for this. :-P )
>
> What's the best way forward?
>
> Thanks for any advice!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding Crush Map

2014-09-02 Thread Robert LeBlanc
According to http://ceph.com/docs/master/rados/operations/crush-map/, you
should be able to construct a clever use of 'step take' and 'step choose'
rules in your CRUSH map to force one copy to a particular bucket and allow
the other two copies to be chosen elsewhere. I was looking for a way to
have some locality like one copy in the current rack, one copy in the
current row and a third copy somewhere else in the data center. I don't
think there is an easy way to figure that out since the clients don't map
to the CRUSH map.


On Tue, Sep 2, 2014 at 9:54 AM, Jakes John  wrote:

> Thanks Loic.
>
>
>
> On Mon, Sep 1, 2014 at 11:31 PM, Loic Dachary  wrote:
>
>> Hi John,
>>
>> On 02/09/2014 05:29, Jakes John wrote:> Hi,
>> >I have some general questions regarding the crush map. It would be
>> helpful if someone can help me out by clarifying them.
>> >
>> > 1.  I saw that a bucket 'host' is always created for the crush maps
>> which are automatically generated by ceph. If I am manually creating
>> crushmap,  do I need to always add a bucket called ' host' ? As I was
>> looking through the source code, I didn't see any need for this. If not
>> necessary, can osd's of the same host be split into mulitple buckets?
>> >
>> > eg : Say host 1 has four osd's- osd.0,osd.1,osd.2, osd.3
>> > host 2 has four osd's-
>> osd.4,osd.5,osd.6,osd.7
>> >
>> > and create two buckets -
>> >
>>
> > HostGroup bucket1- {osd.0, osd.1,osd.4,osd.5}
>> > HostGroup bucket2-{osd.2,osd.3,osd.6,osd.7} where HostGroup is new
>> bucket type instead of the default 'host' type.
>> >
>> >
>> > Is this configuration possible or invalid? If this is possible, I can
>> group SSD's of all hosts into 1 bucket and HDD's into other.
>>
>> What you describe seem possible but I'm not sure what problem you are
>> trying to solve. The crush map described at
>>
>>
>> http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds
>>
>> is not what you want ?
>>
>
> No. I have seen this before. In the example, each of the hosts had either
> only SSD's or only HDD's. In my example, host1 has two ssd's and two hdd's.
>  Similar is the case for host2. So, If I need to have a bucket for SSD's
> only, I need to create a bucket as stated above.
>
> HostGroup bucket1- {osd.0, osd.1,osd.4,osd.5}
>
> Is this possible?
>
>>
>> > 2. I have read in Ceph docs that same osd is not advised to be part of
>> two buckets(two pools).
>>
>> A single OSD should be in a single bucket in the crush map indeed. But it
>> is common for two OSD to be part of multiple pools. The pools are
>> associated with a ruleset and each ruleset can choose in the same set of
>> OSDs.
>>
>> > Is there any reason for it? But,I couldn't find this limitation in the
>> source code.
>>
>> There is no limitation in the code but the crush function has been tested
>> and used with a hierarchy where leaf nodes are not part of more than one
>> bucket.
>>
>>
>
>
>> Cheers
>>
>> > eg:osd.0 is in bucket1 and bucket2.
>> >
>> > Is this configuration possible or invalid? If this is possible, I have
>> the flexibility to have group data which are written to different pools.
>>
>
> So, it is possible right?. I have plans to have third replica to stay in a
> particular rack for all pools.( common to all pools)
>
>
>> >
>> > 3. Is it possible to exclude or include a particular osd/host/rack in
>> the crush mapping?.
>> >
>> > eg: I need to have third replica always in rack3 (a specified
>> row/rack/host based on requirements) . First two can be chosen randomly
>> >
>> > If possible, how can I configure it?
>> >
>>
>
>  Any ideas for 3,4,5 ?
>
>>  >
>> > 4. It is said that osd weights must be configured based on the storage.
>> Say if I have SSD of 512 GB and HDD of 1 TB and if I configure .5 and 1
>> respectively, am I treating both SSD and HDD equally? How do I prioritize
>> SSD over HDD?
>> >
>>
>
>
>
>> > 5. Continuing from 4), If i have mix of SSD's and HDD's in the  same
>> host, what are the best ways possible to utilize the SSD capabilities in
>> the ceph cluster?
>> >
>> >
>>
>
>
>
>> > Looking forward to your help,
>> >
>> > Thanks,
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Robert LeBlanc
We are still pretty early on in our testing of how to best use SSDs as
well. What we are trying right now, for some of the reasons you mentioned
already, is to use bcache as a cache for both journal and data. We have 10
spindles in our boxes with 2 SSDs. We created two bcaches (one for each
SSD) and put five spindles behind it with the journals as just files on the
spindle (because it is hot, it should stay in SSD). This should have the
advantage that if the SSD fails, it could automatically fail to
write-through mode (although I don't think it will help if the SSD suddenly
fails). However, it seems that if any part of the journal is lost, the OSD
is toast and needs to be rebuilt. Bcache was appealing to us because one
SSD could front multiple backend disks and make the most efficient use of
the SSD, it also has write around for large sequential writes so that cache
is not evicted for large sequential writes which spindles are good at.
Since we have a high read cache hit from KVM and other layers, this is
primary intended to help accelerate writes more than reads (we are also
more write heavy in our environment).

So far it seems to help, but we are going to start more in-depth testing
soon. One drawback is that bcache devices don't seem to like partitions, so
we have created the OSDs manually instead if using ceph-deploy.

I too am interested with other's experience with SSD and trying to
cache/accelerate Ceph. I think the Caching pool in the long run will be the
best option, but it can still use some performance tweaking with small
reads before it will be really viable for us.

Robert LeBlanc


On Thu, Sep 4, 2014 at 10:21 AM, Dan Van Der Ster  wrote:

> Dear Cephalopods,
>
> In a few weeks we will receive a batch of 200GB Intel DC S3700’s to
> augment our cluster, and I’d like to hear your practical experience and
> discuss options how best to deploy these.
>
> We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
> they will become 20 OSDs + 4 SSDs per server. Until recently I’ve been
> planning to use the traditional deployment: 5 journal partitions per SSD.
> But as SSD-day approaches, I growing less comfortable with the idea of 5
> OSDs going down every time an SSD fails, so perhaps there are better
> options out there.
>
> Before getting into options, I’m curious about real reliability of these
> drives:
>
> 1) How often are DC S3700's failing in your deployments?
> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
> backfilling which results from an SSD failure? Have you considered tricks
> like increasing the down out interval so backfilling doesn’t happen in this
> case (leaving time for the SSD to be replaced)?
>
> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or
> RAID10 for the journals? If so, are you using the raw block devices or
> formatting it and storing the journals as files on the SSD array(s)? Recent
> discussions seem to indicate that XFS is just as fast as the block dev,
> since these drives are so fast.
>
> Next, I wonder how people with puppet/chef/… are handling the
> creation/re-creation of the SSD devices. Are you just wiping and rebuilding
> all the dependent OSDs completely when the journal dev fails? I’m not keen
> on puppetizing the re-creation of journals for OSDs...
>
> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In this model, when an SSD fails we’d quickly create a
> new journal either on another SSD or on the local OSD filesystem, then
> restart the OSDs before backfilling started. Thoughts?
>
> Lastly, I would also consider using 2 of the SSDs in a data pool (with the
> other 2 SSDs to hold 20 journals — probably in a RAID1 to avoid backfilling
> 10 OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform
> adequately, that’d give us quite a few SSDs to build a dedicated high-IOPS
> pool.
>
> I’d also appreciate any other suggestions/experiences which might be
> relevant.
>
> Thanks!
> Dan
>
> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Robert LeBlanc
So far it was worked really well, we can raise/lower/disable/enable the
cache in realtime and watch how the load and traffic changes. There has
been some positive subjective results, but definitive results are still
forth coming. bcache on CentOS 7 was not easy, makes me wish we were
running Debian or Ubuntu. If there are enough reasons to train our admins
on Debian/Ubuntu in addition to learning CentOS7 for customer facing boxes,
we may move that way for Ceph and OpenStack, but I'm not sure how Red Hat
purchasing Inktank will shift the development from Debian/Ubuntu, so we
don't want to make any big changes until we have a better idea of what the
future looks like. I think the Enterprise versions of Ceph (n-1 or n-2)
will be a bit too old from where we want to be, which I'm sure will work
wonderfully on Red Hat, but how will n.1, n.2 or n.3 run?

Robert LeBlanc


On Thu, Sep 4, 2014 at 11:22 AM, Dan Van Der Ster  wrote:

>  Hi Robert,
>
> That's actually a pretty good idea, since bcache would also accelerate the
> filestore flushes and leveldb. I actually wonder if an SSD-only pool would
> even be faster than such a setup... probably not.
>
> We're using an ancient enterprise n distro, so it will be a bit of a
> headache to get the right kernel, etc .. But my colleague is planning to
> use bcache to accelerate our hypervisors' ephemeral storage, so I guess
> that's a solved problem.
>
> Hmm...
>
> Thanks!
>
> Dan
> On Sep 4, 2014 6:42 PM, Robert LeBlanc  wrote:
>  We are still pretty early on in our testing of how to best use SSDs as
> well. What we are trying right now, for some of the reasons you mentioned
> already, is to use bcache as a cache for both journal and data. We have 10
> spindles in our boxes with 2 SSDs. We created two bcaches (one for each
> SSD) and put five spindles behind it with the journals as just files on the
> spindle (because it is hot, it should stay in SSD). This should have the
> advantage that if the SSD fails, it could automatically fail to
> write-through mode (although I don't think it will help if the SSD suddenly
> fails). However, it seems that if any part of the journal is lost, the OSD
> is toast and needs to be rebuilt. Bcache was appealing to us because one
> SSD could front multiple backend disks and make the most efficient use of
> the SSD, it also has write around for large sequential writes so that cache
> is not evicted for large sequential writes which spindles are good at.
> Since we have a high read cache hit from KVM and other layers, this is
> primary intended to help accelerate writes more than reads (we are also
> more write heavy in our environment).
>
>  So far it seems to help, but we are going to start more in-depth testing
> soon. One drawback is that bcache devices don't seem to like partitions, so
> we have created the OSDs manually instead if using ceph-deploy.
>
>  I too am interested with other's experience with SSD and trying to
> cache/accelerate Ceph. I think the Caching pool in the long run will be the
> best option, but it can still use some performance tweaking with small
> reads before it will be really viable for us.
>
>  Robert LeBlanc
>
>
> On Thu, Sep 4, 2014 at 10:21 AM, Dan Van Der Ster <
> daniel.vanders...@cern.ch> wrote:
>
>> Dear Cephalopods,
>>
>> In a few weeks we will receive a batch of 200GB Intel DC S3700’s to
>> augment our cluster, and I’d like to hear your practical experience and
>> discuss options how best to deploy these.
>>
>> We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
>> they will become 20 OSDs + 4 SSDs per server. Until recently I’ve been
>> planning to use the traditional deployment: 5 journal partitions per SSD.
>> But as SSD-day approaches, I growing less comfortable with the idea of 5
>> OSDs going down every time an SSD fails, so perhaps there are better
>> options out there.
>>
>> Before getting into options, I’m curious about real reliability of these
>> drives:
>>
>> 1) How often are DC S3700's failing in your deployments?
>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
>> backfilling which results from an SSD failure? Have you considered tricks
>> like increasing the down out interval so backfilling doesn’t happen in this
>> case (leaving time for the SSD to be replaced)?
>>
>> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or
>> RAID10 for the journals? If so, are you using the raw block devices or
>> formatting it and storing the journals as files on the SSD array(s)? Recent
>> discussions seem to indicate that XFS is just as fast as the block dev,
>> since 

Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Robert LeBlanc
You should be able to use any block device in a bcache device. Right now,
we are OK losing one SSD and it takes out 5 OSDs. We would rather have
twice the cache. Our opinion may change in the future. We wanted to keep as
much overhead as low as possible. I think we may spend the extra on heavier
duty SSDs; less overhead not having a mirror, and less SSD drives allow us
to put in more high capacity spindles in each host (we have a density
need). We still have to test how much cache is optimal. I'm not sure how
much 1% SSD will help and if 2% will make any difference, etc. That's why
we need to test.

My reasoning is that if we can buffer the writes, hopefully we can write to
the spindles in a more linearly manner and give reads a better chance of
getting serviced faster. My theory is that with all the caching already
happening at KVM, etc that 1% writeback will be much more useful than 1%
read cache because the reads at the OSD level will be cache misses anyway
because they will be cold pages.

Write behind is really our target, reads can be serviced from cache a good
portion of the time, but writes have to always hit a disk (our current disk
system is about 33% read and 66% writes), in the Ceph case (and in our
config) three disks. When you add the latency from all the levels, network,
kernel, buffers, disk, etc it adds up to a lot. If you are always doing
large transfers, then it won't be too noticeable because bandwidth helps
outweigh the latency. But when we are dealing with thousands of VMs all
doing very small things like writing a couple of lines to a log,
reading/writing some database pages, etc, the latency just kills the
performance in a big way. Reads aren't too bad because only the primary OSD
has to service the request, but on writes all three OSDs have to
acknowledge the write. So I'm trying to get the absolute best write
performance as I can. If I can reduce a 10 millisecond write to disk to 1
ms, then I've saved about 18 ms in the transaction (primary OSD write plus
parallel write to secondary OSDs presumably).

With bcache, I'd even like to get rid of the journal writes in a test case.
Since it is hitting SSD and then being serialized to the backing disk
anyway by bcache, it seems that the journal is just a double write penalty.
Then it will be time to tackle the network latency, can't wait to get our
Infiniband gear to test AccelIO with Ceph.

I'm interested to see what you decide to do and what your results are.


On Thu, Sep 4, 2014 at 12:12 PM, Dan Van Der Ster  wrote:

>  I've just been reading the bcache docs. It's a pity the mirrored writes
> aren't implemented yet. Do you know if you can use an md RAID1 as a cache
> dev? And is the graceful failover from wb to writethrough actually working
> without data loss?
>
> Also, write behind sure would help the filestore, since I'm pretty sure
> the same 4k blocks are being overwritten many times (from our RBD clients).
>
> Cheers, Dan
>  On Sep 4, 2014 7:44 PM, Robert LeBlanc  wrote:
>  So far it was worked really well, we can raise/lower/disable/enable the
> cache in realtime and watch how the load and traffic changes. There has
> been some positive subjective results, but definitive results are still
> forth coming. bcache on CentOS 7 was not easy, makes me wish we were
> running Debian or Ubuntu. If there are enough reasons to train our admins
> on Debian/Ubuntu in addition to learning CentOS7 for customer facing boxes,
> we may move that way for Ceph and OpenStack, but I'm not sure how Red Hat
> purchasing Inktank will shift the development from Debian/Ubuntu, so we
> don't want to make any big changes until we have a better idea of what the
> future looks like. I think the Enterprise versions of Ceph (n-1 or n-2)
> will be a bit too old from where we want to be, which I'm sure will work
> wonderfully on Red Hat, but how will n.1, n.2 or n.3 run?
>
>  Robert LeBlanc
>
>
> On Thu, Sep 4, 2014 at 11:22 AM, Dan Van Der Ster <
> daniel.vanders...@cern.ch> wrote:
>
>>  Hi Robert,
>>
>> That's actually a pretty good idea, since bcache would also accelerate
>> the filestore flushes and leveldb. I actually wonder if an SSD-only pool
>> would even be faster than such a setup... probably not.
>>
>> We're using an ancient enterprise n distro, so it will be a bit of a
>> headache to get the right kernel, etc .. But my colleague is planning to
>> use bcache to accelerate our hypervisors' ephemeral storage, so I guess
>> that's a solved problem.
>>
>> Hmm...
>>
>> Thanks!
>>
>> Dan
>>   On Sep 4, 2014 6:42 PM, Robert LeBlanc  wrote:
>>  We are still pretty early on in our testing of how to best use SSDs as
>> well. What we are trying right now, for

Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Robert LeBlanc
This is good to know. I just recompiled the CentOS7 3.10 kernel to enable
bcache (I doubt they patched bcache since they don't compile/enable it).
I've seen when I ran Ceph in VMs on my workstation that there were oops
with bcache, but doing the bcache device and the backend device even with
two copies on a single 7200k disk would bring my machine to its knees. I
think bcache (probably without the patches) will have issues with disks
that don't respond fast enough, but we have yet to have any issues on our
test cluster of 40 OSDs. We also haven't really pushed it yet though.

Are the patches you talk about just backports from later kernels or
something different?

Robert LeBlanc


On Thu, Sep 4, 2014 at 1:13 PM, Stefan Priebe  wrote:

> Hi Dan, hi Robert,
>
> Am 04.09.2014 21:09, schrieb Dan van der Ster:
>
>  Thanks again for all of your input. I agree with your assessment -- in
>> our cluster we avg <3ms for a random (hot) 4k read already, but > 40ms
>> for a 4k write. That's why we're adding the SSDs -- you just can't run a
>> proportioned RBD service without them.
>>
>
> How did you measure these latencies?
>
>
>  I'll definitely give bcache a try in my test setup, but more reading has
>> kinda tempered my expectations -- the rate of oopses and hangs on the
>> bcache ML seems a bit high. And a 3.14 kernel would indeed still be a
>> challenge on our RHEL6 boxes.
>>
>
> bcache works fine with 3.10 and a bunch of patches ;-) Not sure if you can
> upgrade to RHEL7 and also not sure if RHEL has already some of them ready.
>
> We're using bcache on one of our bcache clusters since more than year
> based on kernel 3.10 + 15 patches and i never saw a crash or hang since
> applying them ;-) But yes with a vanilla kernel it's not that stable.
>
> Greets,
> Stefan
>
>  Cheers, Dan
>>
>> September 4 2014 8:47 PM, "Robert LeBlanc" > <mailto:%22Robert%20LeBlanc%22%20>> wrote:
>>
>> You should be able to use any block device in a bcache device. Right
>> now, we are OK losing one SSD and it takes out 5 OSDs. We would
>> rather have twice the cache. Our opinion may change in the future.
>> We wanted to keep as much overhead as low as possible. I think we
>> may spend the extra on heavier duty SSDs; less overhead not having a
>> mirror, and less SSD drives allow us to put in more high capacity
>> spindles in each host (we have a density need). We still have to
>> test how much cache is optimal. I'm not sure how much 1% SSD will
>> help and if 2% will make any difference, etc. That's why we need to
>> test.
>> My reasoning is that if we can buffer the writes, hopefully we can
>> write to the spindles in a more linearly manner and give reads a
>> better chance of getting serviced faster. My theory is that with all
>> the caching already happening at KVM, etc that 1% writeback will be
>> much more useful than 1% read cache because the reads at the OSD
>> level will be cache misses anyway because they will be cold pages.
>> Write behind is really our target, reads can be serviced from cache
>> a good portion of the time, but writes have to always hit a disk
>> (our current disk system is about 33% read and 66% writes), in the
>> Ceph case (and in our config) three disks. When you add the latency
>> from all the levels, network, kernel, buffers, disk, etc it adds up
>> to a lot. If you are always doing large transfers, then it won't be
>> too noticeable because bandwidth helps outweigh the latency. But
>> when we are dealing with thousands of VMs all doing very small
>> things like writing a couple of lines to a log, reading/writing some
>> database pages, etc, the latency just kills the performance in a big
>> way. Reads aren't too bad because only the primary OSD has to
>> service the request, but on writes all three OSDs have to
>> acknowledge the write. So I'm trying to get the absolute best write
>> performance as I can. If I can reduce a 10 millisecond write to disk
>> to 1 ms, then I've saved about 18 ms in the transaction (primary OSD
>> write plus parallel write to secondary OSDs presumably).
>> With bcache, I'd even like to get rid of the journal writes in a
>> test case. Since it is hitting SSD and then being serialized to the
>> backing disk anyway by bcache, it seems that the journal is just a
>> double write penalty. Then it will be time to tackle the network
>> latency, can't wait

Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-22 Thread Robert LeBlanc
We are still in the middle of testing things, but so far we have had more
improvement with SSD journals than the OSD cached with bcache (five OSDs
fronted by one SSD). We still have yet to test if adding a bcache layer in
addition to the SSD journals provides any additional improvements.

Robert LeBlanc

On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson 
wrote:

> On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:
>
>> Hello guys,
>>
>> Was wondering if anyone uses or done some testing with using bcache or
>> enhanceio caching in front of ceph osds?
>>
>> I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds
>> for journals. I've recently purchased four additional ssds to be used
>> for ceph cache pool, but i've found performance of guest vms to be
>> slower with the cache pool for many benchmarks. The write performance
>> has slightly improved, but the read performance has suffered a lot (as
>> much as 60% in some tests).
>>
>> Therefore, I am planning to scrap the cache pool (at least until it
>> matures) and use either bcache or enhanceio instead.
>>
>
> We're actually looking at dm-cache a bit right now. (and talking some of
> the developers about the challenges they are facing to help improve our own
> cache tiering)  No meaningful benchmarks of dm-cache yet though. Bcache,
> enhanceio, and flashcache all look interesting too.  Regarding the cache
> pool: we've got a couple of ideas that should help improve performance,
> especially for reads.  There are definitely advantages to keeping cache
> local to the node though.  I think some form of local node caching could be
> pretty useful going forward.
>
>
>> Thanks
>>
>> Andrei
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cold-storage tuning Ceph

2015-02-23 Thread Robert LeBlanc
Sorry this is delayed, catching up. I beleive this was talked about in
the last Ceph summit. I think this was the blueprint.
https://wiki.ceph.com/Planning/Blueprints/Hammer/Towards_Ceph_Cold_Storage

On Wed, Jan 14, 2015 at 9:35 AM, Martin Millnert  wrote:
> Hello list,
>
> I'm currently trying to understand what I can do with Ceph to optimize
> it for a cold-storage (write-once, read-very-rarely) like scenario,
> trying to compare cost against LTO-6 tape.
>
> There is a single main objective:
>  - minimal cost/GB/month of operations (including power, DC)
>
> To achieve this, I can break it down to:
>  - Use best cost/GB HDD
>* SMR today
>  - Minimal cost/3.5"-slot
>  - Minimal power-utilization/drive
>
> While staying within what is available today, I don't imagine going to
> power-down individual disk slots using IPMI etc, as some vendors allow.
>
> Now, putting Ceph on this, drives will be on, but it would be very
> useful to be able to spin-down drives that aren't used.
>
> It then seems to me that I want to do a few things with Ceph:
>  - Have only a subest of the cluster 'active' for writes at any point in
>time
>  - Yet, still have the entire cluster online and available for reads
>  - Minimize concurrent OSD operations in a node that uses RAM, e.g.
>- Scrubbing, minimal number of OSDs active (ideally max 1)
>- In general, minimize concurrent "active" OSDs as per above
>  - Minimize risk that any type of re-balancing of data occurs at all
>- E.g. use a "high" number of EC parity chunks
>
>
> Assuming e.g. 16 drives/host and 10TB drives, at ~100MB/s read and
> nearly full cluster, deep scrubbing the host would take 18.5 days.
> This means roughly 2 deep scrubs per month.
> Using EC pool, I wouldn't be very worried about errors, so perhaps
> that's ok (calculable), but they need to be repaired obviously.
> Mathematically, I can use an increase of parity chunks to lengthen the
> interval between deep scrubs.
>
>
> Is there anyone on the list who can provide some thoughts on the
> higher-order goal of "Minimizing concurrently active OSDs in a node"?
>
> I imagine I need to steer writes towards a subset of the system - but I
> have no idea how to implement it - using multiple separate clusters eg.
> each OSD on a node participate in unique clusters could perhaps help.
>
> Any feedback appreciated.  It does appear a hot topic (pun intended).
>
> Best,
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Startup Best Practice: gpt/udev or SysVInit/systemd ?

2015-02-24 Thread Robert LeBlanc
We have had good luck with letting udev do it's thing on CentOS7.

On Wed, Feb 18, 2015 at 7:46 PM, Anthony Alba  wrote:
> Hi Cephers,
>
> What is your "best practice" for starting up OSDs?
>
> I am trying to determine the most robust technique on CentOS 7 where I
> have too much choice:
>
> udev/gpt/uuid or /etc/init.d/ceph or /etc/systemd/system/ceph-osd@X
>
> 1. Use udev/gpt/UUID: no OSD  sections in  /etc/ceph/mycluster.conf or
> premounts in /etc/fstab.
> Let udev + ceph-disk-activate do its magic.
>
> 2. Use /etc/init.d/ceph start osd or systemctl start ceph-osd@N
> a. do you change partition UUID so no udev kicks in?
> b. do you keep  [osd.N] sections in /etc/ceph/mycluster.conf
> c. premount all journals/OSDs in /etc/fstab?
>
> The problem with this approach, though very explicit and robust, is
> that it is is hard to maintain
> /etc/fstab on the OSD hosts.
>
> - Anthony
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Robert LeBlanc
We use ceph-disk without any issues on CentOS7. If you want to do a
manual deployment, verfiy you aren't missing any steps in
http://ceph.com/docs/master/install/manual-deployment/#long-form.


On Tue, Feb 24, 2015 at 5:46 PM, Barclay Jameson
 wrote:
> I have tried to install ceph using ceph-deploy but sgdisk seems to
> have too many issues so I did a manual install. After mkfs.btrfs on
> the disks and journals and mounted them I then tried to start the osds
> which failed. The first error was:
> #/etc/init.d/ceph start osd.0
> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
> /var/lib/ceph defines )
>
> I then manually added the osds to the conf file with the following as
> an example:
> [osd.0]
> osd_host = node01
>
> Now when I run the command :
> # /etc/init.d/ceph start osd.0
>
> There is no error or output from the command and in fact when I do a
> ceph -s no osds are listed as being up.
> Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are
> no osd running.
> I also have done htop to see if any process are running and none are shown.
>
> I had this working on SL6.5 with Firefly but Giant on Centos 7 has
> been nothing but a giant pain.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using radosgw with civetweb?

2015-02-25 Thread Robert LeBlanc
We tried to get radosgw working with Apache + mod_fastcgi, but due to
the changes in radosgw, Apache, mode_*cgi, etc and the documentation
lagging and not having a lot of time to devote to it, we abandoned it.
Where it the documentation for civetweb? If it is appliance like and
easy to set-up, we would like to try it to offer some feedback on your
question.

Thanks,
Robert LeBlanc

On Wed, Feb 25, 2015 at 12:31 PM, Sage Weil  wrote:
> Hey,
>
> We are considering switching to civetweb (the embedded/standalone rgw web
> server) as the primary supported RGW frontend instead of the current
> apache + mod-fastcgi or mod-proxy-fcgi approach.  "Supported" here means
> both the primary platform the upstream development focuses on and what the
> downstream Red Hat product will officially support.
>
> How many people are using RGW standalone using the embedded civetweb
> server instead of apache?  In production?  At what scale?  What
> version(s) (civetweb first appeared in firefly and we've backported most
> fixes).
>
> Have you seen any problems?  Any other feedback?  The hope is to (vastly)
> simplify deployment.
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Robert LeBlanc
I think that your problem lies with systemd (even though you are using
SysV syntax, systemd is really doing the work). Systemd does not like
multiple arguments and I think this is why it is failing. There is
supposed to be some work done to get systemd working ok, but I think
it has the limitation of only working with a cluster named 'ceph'
currently.

What I did to get around the problem was to run the osd command manually:

ceph-osd -i 

Once I understand the under-the-hood stuff, I moved to ceph-disk and
now because of the GPT partition IDs, udev automatically starts up the
OSD process at boot/creation and moves to the appropiate CRUSH
location (configuratble in ceph.conf
http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
an example: crush location = host=test rack=rack3 row=row8
datacenter=local region=na-west root=default). To restart an OSD
process, I just kill the PID for the OSD then issue ceph-disk activate
/dev/sdx1 to restart the OSD process. You probably could stop it with
systemctl since I believe udev creates a resource for it (I should
probably look into that now that this system will be going production
soon).

On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson  wrote:
> I'm having a similar issue.
>
> I'm following http://ceph.com/docs/master/install/manual-deployment/ to a T.
>
> I have OSDs on the same host deployed with the short-form and they work
> fine. I am trying to deploy some more via the long form (because I want them
> to appear in a different location in the crush map). Everything through step
> 10 (i.e. ceph osd crush add {id-or-name} {weight}
> [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step 11
> (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
> /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines mon.hobbit01
> osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6
> osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7 osd.15
> osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 osd.5
> osd.4 osd.0)
>
>
>
> On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden  wrote:
>>
>> Also, did you successfully start your monitor(s), and define/create the
>> OSDs within the Ceph cluster itself?
>>
>> There are several steps to creating a Ceph cluster manually.  I'm unsure
>> if you have done the steps to actually create and register the OSDs with the
>> cluster.
>>
>>  - Travis
>>
>> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master  wrote:
>>>
>>> Check firewall rules and selinux. It sometimes is a pain in the ... :)
>>>
>>> 25 lut 2015 01:46 "Barclay Jameson"  napisał(a):
>>>
 I have tried to install ceph using ceph-deploy but sgdisk seems to
 have too many issues so I did a manual install. After mkfs.btrfs on
 the disks and journals and mounted them I then tried to start the osds
 which failed. The first error was:
 #/etc/init.d/ceph start osd.0
 /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
 /var/lib/ceph defines )

 I then manually added the osds to the conf file with the following as
 an example:
 [osd.0]
 osd_host = node01

 Now when I run the command :
 # /etc/init.d/ceph start osd.0

 There is no error or output from the command and in fact when I do a
 ceph -s no osds are listed as being up.
 Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are
 no osd running.
 I also have done htop to see if any process are running and none are
 shown.

 I had this working on SL6.5 with Firefly but Giant on Centos 7 has
 been nothing but a giant pain.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using radosgw with civetweb?

2015-02-25 Thread Robert LeBlanc
Cool, I'll see if we have some cycles to look at it.

On Wed, Feb 25, 2015 at 2:49 PM, Sage Weil  wrote:
> On Wed, 25 Feb 2015, Robert LeBlanc wrote:
>> We tried to get radosgw working with Apache + mod_fastcgi, but due to
>> the changes in radosgw, Apache, mode_*cgi, etc and the documentation
>> lagging and not having a lot of time to devote to it, we abandoned it.
>> Where it the documentation for civetweb? If it is appliance like and
>> easy to set-up, we would like to try it to offer some feedback on your
>> question.
>
> In giant and hammer, it is enabled by default on port 7480.  On firefly,
> you need to add the line
>
>  rgw frontends = fastcgi, civetweb port=7480
>
> to ceph.conf (you can of course adjust the port number if you like) and
> radosgw will run standalone w/ no apache or anything else.
>
> sage
>
>
>>
>> Thanks,
>> Robert LeBlanc
>>
>> On Wed, Feb 25, 2015 at 12:31 PM, Sage Weil  wrote:
>> > Hey,
>> >
>> > We are considering switching to civetweb (the embedded/standalone rgw web
>> > server) as the primary supported RGW frontend instead of the current
>> > apache + mod-fastcgi or mod-proxy-fcgi approach.  "Supported" here means
>> > both the primary platform the upstream development focuses on and what the
>> > downstream Red Hat product will officially support.
>> >
>> > How many people are using RGW standalone using the embedded civetweb
>> > server instead of apache?  In production?  At what scale?  What
>> > version(s) (civetweb first appeared in firefly and we've backported most
>> > fixes).
>> >
>> > Have you seen any problems?  Any other feedback?  The hope is to (vastly)
>> > simplify deployment.
>> >
>> > Thanks!
>> > sage
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clarification of SSD journals for BTRFS rotational HDD

2015-02-25 Thread Robert LeBlanc
I tried finding an answer to this on Google, but couldn't find it.

Since BTRFS can parallel the journal with the write, does it make
sense to have the journal on the SSD (because then we are forcing two
writes instead of one)?

Our plan is to have a caching tier of SSDs in front of our rotational
HDDs and it sounds like the improvements in Hammer will really help
here. If we can take the journals off the SSDs, that just opens up a
bit more space for caching (albeit not much). It specifically makes
the configuration of the host much simpler and a single SSD doesn't
take out 5 HHDs.

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Robert LeBlanc
Step #6 in http://ceph.com/docs/master/install/manual-deployment/#long-form
only set-ups the file structure for the OSD, it doesn't start the long
running process.

On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson  wrote:
> But I already issued that command (back in step 6).
>
> The interesting part is that "ceph-disk activate" apparently does it
> correctly. Even after reboot, the services start as they should.
>
> On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc 
> wrote:
>>
>> I think that your problem lies with systemd (even though you are using
>> SysV syntax, systemd is really doing the work). Systemd does not like
>> multiple arguments and I think this is why it is failing. There is
>> supposed to be some work done to get systemd working ok, but I think
>> it has the limitation of only working with a cluster named 'ceph'
>> currently.
>>
>> What I did to get around the problem was to run the osd command manually:
>>
>> ceph-osd -i 
>>
>> Once I understand the under-the-hood stuff, I moved to ceph-disk and
>> now because of the GPT partition IDs, udev automatically starts up the
>> OSD process at boot/creation and moves to the appropiate CRUSH
>> location (configuratble in ceph.conf
>> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location,
>> an example: crush location = host=test rack=rack3 row=row8
>> datacenter=local region=na-west root=default). To restart an OSD
>> process, I just kill the PID for the OSD then issue ceph-disk activate
>> /dev/sdx1 to restart the OSD process. You probably could stop it with
>> systemctl since I believe udev creates a resource for it (I should
>> probably look into that now that this system will be going production
>> soon).
>>
>> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson  wrote:
>> > I'm having a similar issue.
>> >
>> > I'm following http://ceph.com/docs/master/install/manual-deployment/ to
>> > a T.
>> >
>> > I have OSDs on the same host deployed with the short-form and they work
>> > fine. I am trying to deploy some more via the long form (because I want
>> > them
>> > to appear in a different location in the crush map). Everything through
>> > step
>> > 10 (i.e. ceph osd crush add {id-or-name} {weight}
>> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step
>> > 11
>> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get:
>> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines
>> > mon.hobbit01
>> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12
>> > osd.6
>> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7
>> > osd.15
>> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11
>> > osd.5
>> > osd.4 osd.0)
>> >
>> >
>> >
>> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden 
>> > wrote:
>> >>
>> >> Also, did you successfully start your monitor(s), and define/create the
>> >> OSDs within the Ceph cluster itself?
>> >>
>> >> There are several steps to creating a Ceph cluster manually.  I'm
>> >> unsure
>> >> if you have done the steps to actually create and register the OSDs
>> >> with the
>> >> cluster.
>> >>
>> >>  - Travis
>> >>
>> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master 
>> >> wrote:
>> >>>
>> >>> Check firewall rules and selinux. It sometimes is a pain in the ... :)
>> >>>
>> >>> 25 lut 2015 01:46 "Barclay Jameson" 
>> >>> napisał(a):
>> >>>
>> >>>> I have tried to install ceph using ceph-deploy but sgdisk seems to
>> >>>> have too many issues so I did a manual install. After mkfs.btrfs on
>> >>>> the disks and journals and mounted them I then tried to start the
>> >>>> osds
>> >>>> which failed. The first error was:
>> >>>> #/etc/init.d/ceph start osd.0
>> >>>> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
>> >>>> /var/lib/ceph defines )
>> >>>>
>> >>>> I then manually added the osds to the conf file with the following as
>> >>>> an example:
>> >>>> [osd.0]
>> >>>> osd_host = node01
>> >>>>
>> >>>> Now when I run the

Re: [ceph-users] who is using radosgw with civetweb?

2015-02-26 Thread Robert LeBlanc
Thanks, we were able to get it up and running very quickly. If it
performs well, I don't see any reason to use Apache+fast_cgi. I don't
have any problems just focusing on civetweb.

On Wed, Feb 25, 2015 at 2:49 PM, Sage Weil  wrote:
> On Wed, 25 Feb 2015, Robert LeBlanc wrote:
>> We tried to get radosgw working with Apache + mod_fastcgi, but due to
>> the changes in radosgw, Apache, mode_*cgi, etc and the documentation
>> lagging and not having a lot of time to devote to it, we abandoned it.
>> Where it the documentation for civetweb? If it is appliance like and
>> easy to set-up, we would like to try it to offer some feedback on your
>> question.
>
> In giant and hammer, it is enabled by default on port 7480.  On firefly,
> you need to add the line
>
>  rgw frontends = fastcgi, civetweb port=7480
>
> to ceph.conf (you can of course adjust the port number if you like) and
> radosgw will run standalone w/ no apache or anything else.
>
> sage
>
>
>>
>> Thanks,
>> Robert LeBlanc
>>
>> On Wed, Feb 25, 2015 at 12:31 PM, Sage Weil  wrote:
>> > Hey,
>> >
>> > We are considering switching to civetweb (the embedded/standalone rgw web
>> > server) as the primary supported RGW frontend instead of the current
>> > apache + mod-fastcgi or mod-proxy-fcgi approach.  "Supported" here means
>> > both the primary platform the upstream development focuses on and what the
>> > downstream Red Hat product will officially support.
>> >
>> > How many people are using RGW standalone using the embedded civetweb
>> > server instead of apache?  In production?  At what scale?  What
>> > version(s) (civetweb first appeared in firefly and we've backported most
>> > fixes).
>> >
>> > Have you seen any problems?  Any other feedback?  The hope is to (vastly)
>> > simplify deployment.
>> >
>> > Thanks!
>> > sage
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using radosgw with civetweb?

2015-02-26 Thread Robert LeBlanc
[client.radosgw.gateway]
host = radosgw1
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/radosgw/client.radosgw.gateway.log
rgw print continue = false
rgw enable ops log = false
rgw ops log rados = false
rgw ops log data backlog = 4096
rgw frontends = civetweb port=7480

This is firefly on CentOS 6 connecting to a giant cluster.
/etc/init.d/ceph-radosgw start

Just make sure the user defined in /etc/init.d/ceph-radosgw can
read/write to the files listed in the section (for us it was the
apache user).

On Thu, Feb 26, 2015 at 11:39 AM, Deneau, Tom  wrote:
> Robert --
>
> We are still having trouble with this.
>
> Can you share your [client.radosgw.gateway] section of ceph.conf and
> were there any other special things to be aware of?
>
> -- Tom
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Robert LeBlanc
> Sent: Thursday, February 26, 2015 12:27 PM
> To: Sage Weil
> Cc: Ceph-User; ceph-devel
> Subject: Re: [ceph-users] who is using radosgw with civetweb?
>
> Thanks, we were able to get it up and running very quickly. If it performs 
> well, I don't see any reason to use Apache+fast_cgi. I don't have any 
> problems just focusing on civetweb.
>
> On Wed, Feb 25, 2015 at 2:49 PM, Sage Weil  wrote:
>> On Wed, 25 Feb 2015, Robert LeBlanc wrote:
>>> We tried to get radosgw working with Apache + mod_fastcgi, but due to
>>> the changes in radosgw, Apache, mode_*cgi, etc and the documentation
>>> lagging and not having a lot of time to devote to it, we abandoned it.
>>> Where it the documentation for civetweb? If it is appliance like and
>>> easy to set-up, we would like to try it to offer some feedback on
>>> your question.
>>
>> In giant and hammer, it is enabled by default on port 7480.  On
>> firefly, you need to add the line
>>
>>  rgw frontends = fastcgi, civetweb port=7480
>>
>> to ceph.conf (you can of course adjust the port number if you like)
>> and radosgw will run standalone w/ no apache or anything else.
>>
>> sage
>>
>>
>>>
>>> Thanks,
>>> Robert LeBlanc
>>>
>>> On Wed, Feb 25, 2015 at 12:31 PM, Sage Weil  wrote:
>>> > Hey,
>>> >
>>> > We are considering switching to civetweb (the embedded/standalone
>>> > rgw web
>>> > server) as the primary supported RGW frontend instead of the
>>> > current apache + mod-fastcgi or mod-proxy-fcgi approach.
>>> > "Supported" here means both the primary platform the upstream
>>> > development focuses on and what the downstream Red Hat product will 
>>> > officially support.
>>> >
>>> > How many people are using RGW standalone using the embedded
>>> > civetweb server instead of apache?  In production?  At what scale?
>>> > What
>>> > version(s) (civetweb first appeared in firefly and we've backported
>>> > most fixes).
>>> >
>>> > Have you seen any problems?  Any other feedback?  The hope is to
>>> > (vastly) simplify deployment.
>>> >
>>> > Thanks!
>>> > sage
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using radosgw with civetweb?

2015-02-26 Thread Robert LeBlanc
+1 for proxy. Keep the civetweb lean and mean and if people need
"extras" let the proxy handle this. Proxies are easy to set-up and a
simple example could be included in the documentation.

On Thu, Feb 26, 2015 at 11:43 AM, Wido den Hollander  wrote:
>
>
>> Op 26 feb. 2015 om 18:22 heeft Sage Weil  het volgende 
>> geschreven:
>>
>>> On Thu, 26 Feb 2015, Wido den Hollander wrote:
 On 25-02-15 20:31, Sage Weil wrote:
 Hey,

 We are considering switching to civetweb (the embedded/standalone rgw web
 server) as the primary supported RGW frontend instead of the current
 apache + mod-fastcgi or mod-proxy-fcgi approach.  "Supported" here means
 both the primary platform the upstream development focuses on and what the
 downstream Red Hat product will officially support.

 How many people are using RGW standalone using the embedded civetweb
 server instead of apache?  In production?  At what scale?  What
 version(s) (civetweb first appeared in firefly and we've backported most
 fixes).

 Have you seen any problems?  Any other feedback?  The hope is to (vastly)
 simplify deployment.
>>>
>>> It seems like Civetweb listens on 0.0.0.0 by default and that doesn't seem
>>> safe to me.
>>
>> Can you clarify?  Is that because people may inadvertantly run this on a
>> public host and not realize that the host is answering requests?
>>
>
> Yes, mainly. I think we should encourage users to run Apache, Nginx or 
> Varnish as a proxy/filter in front.
>
> I'd just suggest to bind on localhost by default and let the user choose 
> otherwise.
>
>> If we move to a world where this is the default/preferred route, this
>> seems like a good thing.. if they don't want to respond on an address they
>> can specify which IP to bind to?
>>
>
> Most services listen on localhost unless specified otherwise.
>
>>> In most deployments you'll put Apache, Nginx or Varnish in front of RGW to 
>>> do
>>> the proper HTTP handling.
>>>
>>> I'd say that Civetweb should listen on 127.0.0.1:7480/[::1]:7480 by default.
>>>
>>> And make sure it listens on IPv6 by default :-)
>>
>> Yeah, +1 on IPv6:)
>>
>> sage
>>
>>
>>>
>>> Wido
>>>
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] old osds take much longer to start than newer osd

2015-02-27 Thread Robert LeBlanc
Does deleting/reformatting the old osds improve the performance?

On Fri, Feb 27, 2015 at 6:02 AM, Corin Langosch
 wrote:
> Hi guys,
>
> I'm using ceph for a long time now, since bobtail. I always upgraded every 
> few weeks/ months to the latest stable
> release. Of course I also removed some osds and added new ones. Now during 
> the last few upgrades (I just upgraded from
> 80.6 to 80.8) I noticed that old osds take much longer to startup than equal 
> newer osds (same amount of data/ disk
> usage, same kind of storage+journal backing device (ssd), same weight, same 
> number of pgs, ...). I know I observed the
> same behavior earlier but just didn't really care about it. Here are the 
> relevant log entries (host of osd.0 and osd.15
> has less cpu power than the others):
>
> old osds (average pgs load time: 1.5 minutes)
>
> 2015-02-27 13:44:23.134086 7ffbfdcbe780  0 osd.0 19323 load_pgs
> 2015-02-27 13:49:21.453186 7ffbfdcbe780  0 osd.0 19323 load_pgs opened 824 pgs
>
> 2015-02-27 13:41:32.219503 7f197b0dd780  0 osd.3 19317 load_pgs
> 2015-02-27 13:42:56.310874 7f197b0dd780  0 osd.3 19317 load_pgs opened 776 pgs
>
> 2015-02-27 13:38:43.909464 7f450ac90780  0 osd.6 19309 load_pgs
> 2015-02-27 13:40:40.080390 7f450ac90780  0 osd.6 19309 load_pgs opened 806 pgs
>
> 2015-02-27 13:36:14.451275 7f3c41d33780  0 osd.9 19301 load_pgs
> 2015-02-27 13:37:22.446285 7f3c41d33780  0 osd.9 19301 load_pgs opened 795 pgs
>
> new osds (average pgs load time: 3 seconds)
>
> 2015-02-27 13:44:25.529743 7f2004617780  0 osd.15 19325 load_pgs
> 2015-02-27 13:44:36.197221 7f2004617780  0 osd.15 19325 load_pgs opened 873 
> pgs
>
> 2015-02-27 13:41:29.176647 7fb147fb3780  0 osd.16 19315 load_pgs
> 2015-02-27 13:41:31.681722 7fb147fb3780  0 osd.16 19315 load_pgs opened 848 
> pgs
>
> 2015-02-27 13:38:41.470761 7f9c404be780  0 osd.17 19307 load_pgs
> 2015-02-27 13:38:43.737473 7f9c404be780  0 osd.17 19307 load_pgs opened 821 
> pgs
>
> 2015-02-27 13:36:10.997766 7f7315e99780  0 osd.18 19299 load_pgs
> 2015-02-27 13:36:13.511898 7f7315e99780  0 osd.18 19299 load_pgs opened 815 
> pgs
>
> The old osds also take more memory, here's an example:
>
> root 15700 22.8  0.7 1423816 485552 ?  Ssl  13:36   4:55 
> /usr/bin/ceph-osd -i 9 --pid-file
> /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.conf --cluster ceph
> root 15270 15.4  0.4 1227140 297032 ?  Ssl  13:36   3:20 
> /usr/bin/ceph-osd -i 18 --pid-file
> /var/run/ceph/osd.18.pid -c /etc/ceph/ceph.conf --cluster ceph
>
>
> It seems to me there is still some old data around for the old osds which was 
> not properly migrated/ cleaned up during
> the upgrades. The cluster is healthy, no problems at all the last few weeks. 
> Is there any way to clean this up?
>
> Thanks
> Corin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification of SSD journals for BTRFS rotational HDD

2015-02-27 Thread Robert LeBlanc
Also sending to the devel list to see if they have some insight.

On Wed, Feb 25, 2015 at 3:01 PM, Robert LeBlanc  wrote:
> I tried finding an answer to this on Google, but couldn't find it.
>
> Since BTRFS can parallel the journal with the write, does it make
> sense to have the journal on the SSD (because then we are forcing two
> writes instead of one)?
>
> Our plan is to have a caching tier of SSDs in front of our rotational
> HDDs and it sounds like the improvements in Hammer will really help
> here. If we can take the journals off the SSDs, that just opens up a
> bit more space for caching (albeit not much). It specifically makes
> the configuration of the host much simpler and a single SSD doesn't
> take out 5 HHDs.
>
> Thanks,
> Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

2015-03-03 Thread Robert LeBlanc
I would be inclined to shut down both OSDs in a node, let the cluster
recover. Once it is recovered, shut down the next two, let it recover.
Repeat until all the OSDs are taken out of the cluster. Then I would
set nobackfill and norecover. Then remove the hosts/disks from the
CRUSH then unset nobackfill and norecover.

That should give you a few small changes (when you shut down OSDs) and
then one big one to get everything in the final place. If you are
still adding new nodes, when nobackfill and norecover is set, you can
add them in so that the one big relocate fills the new drives too.

On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic  wrote:
> Thx Irek. Number of replicas is 3.
>
> I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
> decommissioned), which is further connected to a new 10G switch/network with
> 3 servers on it with 12 OSDs each.
> I'm decommissioning old 3 nodes on 1G network...
>
> So you suggest removing whole node with 2 OSDs manually from crush map?
> Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas
> were originally been distributed over all 3 nodes. So anyway It could be
> safe to remove 2 OSDs at once together with the node itself...since replica
> count is 3...
> ?
>
> Thx again for your time
>
> On Mar 3, 2015 1:35 PM, "Irek Fasikhov"  wrote:
>>
>> Once you have only three nodes in the cluster.
>> I recommend you add new nodes to the cluster, and then delete the old.
>>
>> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov :
>>>
>>> You have a number of replication?
>>>
>>> 2015-03-03 15:14 GMT+03:00 Andrija Panic :

 Hi Irek,

 yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
 degraded and moved/recovered.
 When I after that removed it from Crush map "ceph osd crush rm id",
 that's when the stuff with 37% happened.

 And thanks Irek for help - could you kindly just let me know of the
 prefered steps when removing whole node?
 Do you mean I first stop all OSDs again, or just remove each OSD from
 crush map, or perhaps, just decompile cursh map, delete the node 
 completely,
 compile back in, and let it heal/recover ?

 Do you think this would result in less data missplaces and moved arround
 ?

 Sorry for bugging you, I really appreaciate your help.

 Thanks

 On 3 March 2015 at 12:58, Irek Fasikhov  wrote:
>
> A large percentage of the rebuild of the cluster map (But low
> percentage degradation). If you had not made "ceph osd crush rm id", the
> percentage would be low.
> In your case, the correct option is to remove the entire node, rather
> than each disk individually
>
> 2015-03-03 14:27 GMT+03:00 Andrija Panic :
>>
>> Another question - I mentioned here 37% of objects being moved arround
>> - this is MISPLACED object (degraded objects were 0.001%, after I 
>> removed 1
>> OSD from cursh map (out of 44 OSD or so).
>>
>> Can anybody confirm this is normal behaviour - and are there any
>> workarrounds ?
>>
>> I understand this is because of the object placement algorithm of
>> CEPH, but still 37% of object missplaces just by removing 1 OSD from 
>> crush
>> maps out of 44 make me wonder why this large percentage ?
>>
>> Seems not good to me, and I have to remove another 7 OSDs (we are
>> demoting some old hardware nodes). This means I can potentialy go with 7 
>> x
>> the same number of missplaced objects...?
>>
>> Any thoughts ?
>>
>> Thanks
>>
>> On 3 March 2015 at 12:14, Andrija Panic 
>> wrote:
>>>
>>> Thanks Irek.
>>>
>>> Does this mean, that after peering for each PG, there will be delay
>>> of 10sec, meaning that every once in a while, I will have 10sec od the
>>> cluster NOT being stressed/overloaded, and then the recovery takes 
>>> place for
>>> that PG, and then another 10sec cluster is fine, and then stressed 
>>> again ?
>>>
>>> I'm trying to understand process before actually doing stuff (config
>>> reference is there on ceph.com but I don't fully understand the process)
>>>
>>> Thanks,
>>> Andrija
>>>
>>> On 3 March 2015 at 11:32, Irek Fasikhov  wrote:

 Hi.

 Use value "osd_recovery_delay_start"
 example:
 [root@ceph08 ceph]# ceph --admin-daemon
 /var/run/ceph/ceph-osd.94.asok config show  | grep 
 osd_recovery_delay_start
   "osd_recovery_delay_start": "10"

 2015-03-03 13:13 GMT+03:00 Andrija Panic :
>
> HI Guys,
>
> I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it
> caused over 37% od the data to rebalance - let's say this is fine 
> (this is
> when I removed it frm Crush Map).
>
> I'm wondering - I have previously set some throtling mechanism, but
>>>

Re: [ceph-users] Implement replication network with live cluster

2015-03-04 Thread Robert LeBlanc
If I remember right, someone has done this on a live cluster without
any issues. I seem to remember that it had a fallback mechanism if the
OSDs couldn't be reached on the cluster network to contact them on the
public network. You could test it pretty easily without much impact.
Take one OSD that has both networks and configure it and restart the
process. If all the nodes (specifically the old ones with only one
network) is able to connect to it, then you are good to go by
restarting one OSD at a time.

On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic  wrote:
> Hi,
>
> I'm having a live cluster with only public network (so no explicit network
> configuraion in the ceph.conf file)
>
> I'm wondering what is the procedure to implement dedicated
> Replication/Private and Public network.
> I've read the manual, know how to do it in ceph.conf, but I'm wondering
> since this is already running cluster - what should I do after I change
> ceph.conf on all nodes ?
> Restarting OSDs one by one, or... ? Is there any downtime expected ? - for
> the replication network to actually imlemented completely.
>
>
> Another related quetion:
>
> Also, I'm demoting some old OSDs, on old servers, I will have them all
> stoped, but would like to implement replication network before actually
> removing old OSDs from crush map - since lot of data will be moved arround.
>
> My old nodes/OSDs (that will be stoped before I implement replication
> network) - do NOT have dedicated NIC for replication network, in contrast to
> new nodes/OSDs. So there will be still reference to these old OSD in the
> crush map.
> Will this be a problem - me changing/implementing replication network that
> WILL work on new nodes/OSDs, but not on old ones since they don't have
> dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like
> opinion.
>
> Or perhaps i might remove OSD from crush map with prior seting of
> nobackfill and   norecover (so no rebalancing happens) and then implement
> replication netwotk?
>
>
> Sorry for old post, but...
>
> Thanks,
> --
>
> Andrija Panić
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

2015-03-04 Thread Robert LeBlanc
You will most likely have a very high relocation percentage. Backfills
always are more impactful on smaller clusters, but "osd max backfills"
should be what you need to help reduce the impact. The default is 10,
you will want to use 1.

I didn't catch which version of Ceph you are running, but I think
there was some priority work done in firefly to help make backfills
lower priority. I think it has gotten better in later versions.

On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic  wrote:
> Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush
> map - weather that will cause more than 37% of data moved (80% or whatever)
>
> I'm also wondering if the thortling that I applied is fine or not - I will
> introduce the osd_recovery_delay_start 10sec as Irek said.
>
> I'm just wondering hom much will be the performance impact, because:
> - when stoping OSD, the impact while backfilling was fine more or a less - I
> can leave with this
> - when I removed OSD from cursh map - first 1h or so, impact was tremendous,
> and later on during recovery process impact was much less but still
> noticable...
>
> Thanks for the tip of course !
> Andrija
>
> On 3 March 2015 at 18:34, Robert LeBlanc  wrote:
>>
>> I would be inclined to shut down both OSDs in a node, let the cluster
>> recover. Once it is recovered, shut down the next two, let it recover.
>> Repeat until all the OSDs are taken out of the cluster. Then I would
>> set nobackfill and norecover. Then remove the hosts/disks from the
>> CRUSH then unset nobackfill and norecover.
>>
>> That should give you a few small changes (when you shut down OSDs) and
>> then one big one to get everything in the final place. If you are
>> still adding new nodes, when nobackfill and norecover is set, you can
>> add them in so that the one big relocate fills the new drives too.
>>
>> On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic 
>> wrote:
>> > Thx Irek. Number of replicas is 3.
>> >
>> > I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
>> > decommissioned), which is further connected to a new 10G switch/network
>> > with
>> > 3 servers on it with 12 OSDs each.
>> > I'm decommissioning old 3 nodes on 1G network...
>> >
>> > So you suggest removing whole node with 2 OSDs manually from crush map?
>> > Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas
>> > were originally been distributed over all 3 nodes. So anyway It could be
>> > safe to remove 2 OSDs at once together with the node itself...since
>> > replica
>> > count is 3...
>> > ?
>> >
>> > Thx again for your time
>> >
>> > On Mar 3, 2015 1:35 PM, "Irek Fasikhov"  wrote:
>> >>
>> >> Once you have only three nodes in the cluster.
>> >> I recommend you add new nodes to the cluster, and then delete the old.
>> >>
>> >> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov :
>> >>>
>> >>> You have a number of replication?
>> >>>
>> >>> 2015-03-03 15:14 GMT+03:00 Andrija Panic :
>> >>>>
>> >>>> Hi Irek,
>> >>>>
>> >>>> yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
>> >>>> degraded and moved/recovered.
>> >>>> When I after that removed it from Crush map "ceph osd crush rm id",
>> >>>> that's when the stuff with 37% happened.
>> >>>>
>> >>>> And thanks Irek for help - could you kindly just let me know of the
>> >>>> prefered steps when removing whole node?
>> >>>> Do you mean I first stop all OSDs again, or just remove each OSD from
>> >>>> crush map, or perhaps, just decompile cursh map, delete the node
>> >>>> completely,
>> >>>> compile back in, and let it heal/recover ?
>> >>>>
>> >>>> Do you think this would result in less data missplaces and moved
>> >>>> arround
>> >>>> ?
>> >>>>
>> >>>> Sorry for bugging you, I really appreaciate your help.
>> >>>>
>> >>>> Thanks
>> >>>>
>> >>>> On 3 March 2015 at 12:58, Irek Fasikhov  wrote:
>> >>>>>
>> >>>>> A large percentage of the rebuild of the cluster map (But low
>> >>>>> percentage degradation). If you had not made "ceph osd crush rm id",
>> &g

Re: [ceph-users] Implement replication network with live cluster

2015-03-04 Thread Robert LeBlanc
If the data have been replicated to new OSDs, it will be able to
function properly even them them down or only on the public network.

On Wed, Mar 4, 2015 at 9:49 AM, Andrija Panic  wrote:
> "I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
> that are stoped (and cluster resynced after that) ?"
>
> I wanted to say: it doesnt matter (I guess?) that my Crush map is still
> referencing old OSD nodes that are already stoped. Tired, sorry...
>
> On 4 March 2015 at 17:48, Andrija Panic  wrote:
>>
>> That was my thought, yes - I found this blog that confirms what you are
>> saying I guess:
>> http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/
>> I will do that... Thx
>>
>> I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
>> that are stoped (and cluster resynced after that) ?
>>
>> Thx again for the help
>>
>> On 4 March 2015 at 17:44, Robert LeBlanc  wrote:
>>>
>>> If I remember right, someone has done this on a live cluster without
>>> any issues. I seem to remember that it had a fallback mechanism if the
>>> OSDs couldn't be reached on the cluster network to contact them on the
>>> public network. You could test it pretty easily without much impact.
>>> Take one OSD that has both networks and configure it and restart the
>>> process. If all the nodes (specifically the old ones with only one
>>> network) is able to connect to it, then you are good to go by
>>> restarting one OSD at a time.
>>>
>>> On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic 
>>> wrote:
>>> > Hi,
>>> >
>>> > I'm having a live cluster with only public network (so no explicit
>>> > network
>>> > configuraion in the ceph.conf file)
>>> >
>>> > I'm wondering what is the procedure to implement dedicated
>>> > Replication/Private and Public network.
>>> > I've read the manual, know how to do it in ceph.conf, but I'm wondering
>>> > since this is already running cluster - what should I do after I change
>>> > ceph.conf on all nodes ?
>>> > Restarting OSDs one by one, or... ? Is there any downtime expected ? -
>>> > for
>>> > the replication network to actually imlemented completely.
>>> >
>>> >
>>> > Another related quetion:
>>> >
>>> > Also, I'm demoting some old OSDs, on old servers, I will have them all
>>> > stoped, but would like to implement replication network before actually
>>> > removing old OSDs from crush map - since lot of data will be moved
>>> > arround.
>>> >
>>> > My old nodes/OSDs (that will be stoped before I implement replication
>>> > network) - do NOT have dedicated NIC for replication network, in
>>> > contrast to
>>> > new nodes/OSDs. So there will be still reference to these old OSD in
>>> > the
>>> > crush map.
>>> > Will this be a problem - me changing/implementing replication network
>>> > that
>>> > WILL work on new nodes/OSDs, but not on old ones since they don't have
>>> > dedicated NIC ? I guess not since old OSDs are stoped anyway, but would
>>> > like
>>> > opinion.
>>> >
>>> > Or perhaps i might remove OSD from crush map with prior seting of
>>> > nobackfill and   norecover (so no rebalancing happens) and then
>>> > implement
>>> > replication netwotk?
>>> >
>>> >
>>> > Sorry for old post, but...
>>> >
>>> > Thanks,
>>> > --
>>> >
>>> > Andrija Panić
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>
>
>
>
> --
>
> Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph User Teething Problems

2015-03-04 Thread Robert LeBlanc
I can't help much on the MDS front, but here is some answers and my
view on some of it.

On Wed, Mar 4, 2015 at 1:27 PM, Datatone Lists  wrote:
> I have been following ceph for a long time. I have yet to put it into
> service, and I keep coming back as btrfs improves and ceph reaches
> higher version numbers.
>
> I am now trying ceph 0.93 and kernel 4.0-rc1.
>
> Q1) Is it still considered that btrfs is not robust enough, and that
> xfs should be used instead? [I am trying with btrfs].

We are moving forward with btrfs on our production cluster aware that
there may be performance issues. So far, it seems the later kernels
have resolved the issues we've seen with snapshots. As the system
grows we will keep an eye on it and are prepared to move to XFS if
needed.

> I followed the manual deployment instructions on the web site
> (http://ceph.com/docs/master/install/manual-deployment/) and I managed
> to get a monitor and several osds running and apparently working. The
> instructions fizzle out without explaining how to set up mds. I went
> back to mkcephfs and got things set up that way. The mds starts.
>
> [Please don't mention ceph-deploy]
>
> The first thing that I noticed is that (whether I set up mon and osds
> by following the manual deployment, or using mkcephfs), the correct
> default pools were not created.

> bash-4.3# ceph osd lspools
> 0 rbd,
> bash-4.3#
>
>  I get only 'rbd' created automatically. I deleted this pool, and
>  re-created data, metadata and rbd manually. When doing this, I had to
>  juggle with the pg- num in order to avoid the 'too many pgs for osd'.
>  I have three osds running at the moment, but intend to add to these
>  when I have some experience of things working reliably. I am puzzled,
>  because I seem to have to set the pg-num for the pool to a number that
>  makes (N-pools x pg-num)/N-osds come to the right kind of number. So
>  this implies that I can't really expand a set of pools by adding osds
>  at a later date.
>
> Q2) Is there any obvious reason why my default pools are not getting
> created automatically as expected?

Since Giant, these pools are not automatically created, only the rbd pool is.

> Q3) Can pg-num be modified for a pool later? (If the number of osds is
> increased dramatically).

pg_num and pgp_num can be increased (not decreased) on the fly later
to expand with more OSDs.

> Finally, when I try to mount cephfs, I get a mount 5 error.
>
> "A mount 5 error typically occurs if a MDS server is laggy or if it
> crashed. Ensure at least one MDS is up and running, and the cluster is
> active + healthy".
>
> My mds is running, but its log is not terribly active:
>
> 2015-03-04 17:47:43.177349 7f42da2c47c0  0 ceph version 0.93
> (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110
> 2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors
> {default=true}
>
> (This is all there is in the log).
>
> I think that a key indicator of the problem must be this from the
> monitor log:
>
> 2015-03-04 16:53:20.715132 7f3cd0014700  1
> mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.?
> [2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem
> disabled
>
> (I have added the '' sections to obscure my ip address)
>
> Q4) Can you give me an idea of what is wrong that causes the mds to not
> play properly?
>
> I think that there are some typos on the manual deployment pages, for
> example:
>
> ceph-osd id={osd-num}
>
> This is not right. As far as I am aware it should be:
>
> ceph-osd -i {osd-num}

There are a few of these, usually running --help for the command gives
you the right syntax needed for the version you have installed. But it
is still very confusing.

> An observation. In principle, setting things up manually is not all
> that complicated, provided that clear and unambiguous instructions are
> provided. This simple piece of documentation is very important. My view
> is that the existing manual deployment instructions gets a bit confused
> and confusing when it gets to the osd setup, and the mds setup is
> completely absent.
>
> For someone who knows, this would be a fairly simple and fairly quick
> operation to review and revise this part of the documentation. I
> suspect that this part suffers from being really obvious stuff to the
> well initiated. For those of us closer to the start, this forms the
> ends of the threads that have to be picked up before the journey can be
> made.
>
> Very best regards,
> David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph User Teething Problems

2015-03-05 Thread Robert LeBlanc
David,

You will need to up the limit of open files in the linux system. Check
/etc/security/limits.conf. it is explained somewhere in the docs and the
autostart scripts 'fixes' the issue for most people. When I did a manual
deploy for the same reasons you are, I ran into this too.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 5, 2015 3:14 AM, "Datatone Lists"  wrote:

>
> Thank you all for such wonderful feedback.
>
> Thank you to John Spray for putting me on the right track. I now see
> that the cephfs aspect of the project is being de-emphasised, so that
> the manual deployment instructions tell how to set up the object store,
> and then the cephfs is a separate issue that needs to be explicitly set
> up and configured in its own right. So that explains why the cephfs
> pools are not created by default, and why the required cephfs pools are
> now referred to, not as 'data' and 'metadata', but 'cepfs_data' and
> 'cephfs_metadata'. I have created these pools, and created a new cephfs
> filesystem, and I can mount it without problem.
>
> This confirms my suspicion that the manual deployment pages are in need
> of review and revision. They still refer to three default pools. I am
> happy that this section should deal with the object store setup only,
> but I still think that the osd part is a bit confused and confusing,
> particularly with respect to what is done on which machine. It would
> then be useful to say something like "this completes the configuration
> of the basic store. If you wish to use cephfs, you must set up a
> metadata server, appropriate pools, and a cephfs filesystem. (See
> http://...)".
>
> I was not trying to be smart or obscure when I made a brief and
> apparently dismissive reference to ceph-deploy. I railed against it and
> the demise of mkcephfs on this list at the point that mkcephfs was
> discontinued in the releases. That caused a few supportive responses at
> the time, so I know that I'm not alone. I did not wish to trawl over
> those arguments again unnecessarily.
>
> There is a principle that is being missed. The 'ceph' code contains
> everything required to set up and operate a ceph cluster. There should
> be documentation detailing how this is done.
>
> 'Ceph-deploy' is a separate thing. It is one of several tools that
> promise to make setting things up easy. However, my resistance is based
> on two factors. If I recall correctly, it is one of those projects in
> which the configuration needs to know what 'distribution' is being
> used. (Presumably, this is to try to deduce where various things are
> located). So if one is not using one of these 'distributions', one is
> stuffed right from the start. Secondly, the challenge that we are
> trying to overcome is learning what the various ceph components need,
> and how they need to be set up and configured. I don't think that the
> "don't worry your pretty little head about that, we have a natty tool
> to do it for you" approach is particularly useful.
>
> So I am not knocking ceph-deploy, Travis, it is just that I do not
> believe that it is relevant or useful to me at this point in time.
>
> I see that Lionel Bouton seems to share my views here.
>
> In general, the ceph documentation (in my humble opinion) needs to be
> draughted with a keen eye on the required scope. Deal with ceph; don't
> let it get contaminated with 'ceph-deploy', 'upstart', 'systemd', or
> anything else that is not actually part of ceph.
>
> As an example, once you have configured your osd, you start it with:
>
> ceph-osd -i {osd-number}
>
> It is as simple as that!
>
> If it is required to start the osd automatically, then that will be
> done using sysvinit, upstart, systemd, or whatever else is being used
> to bring the system up in the first place. It is unnecessary and
> confusing to try to second-guess the environment in which ceph may be
> being used, and contaminate the documentation with such details.
> (Having said that, I see no problem with adding separate, helpful,
> sections such as "Suggestions for starting using 'upstart'", or
> "Suggestions for starting using 'systemd'").
>
> So I would reiterate the point that the really important documentation
> is probably quite simple for an expert to produce. Just spell out what
> each component needs in terms of keys, access to keys, files, and so
> on. Spell out how to set everything up. Also how to change things after
> the event, so that 'trial and error' does not have to contain really
> expensive errors. Once we un

Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

2015-03-05 Thread Robert LeBlanc
Setting an OSD out will start the rebalance with the degraded object count.
The OSD is still alive and can participate in the relocation of the
objects. This is preferable so that you don't happen to get less the
min_size because a disk fails during the rebalance then I/O stops on the
cluster.

Because CRUSH is an algorithm, anything that changes algorithm will cause a
change in the output (location). When you set/fail an OSD, it changes the
CRUSH, but the host and weight of the host are still in effect. When you
remove the host or change the weight of the host (by removing a single
OSD), it makes a change to the algorithm which will also cause some changes
in how it computes the locations.

Disclaimer - I have not tried this

It may be possible to minimize the data movement by doing the following:

   1. set norecover and nobackfill on the cluster
   2. Set the OSDs to be removed to "out"
   3. Adjust the weight of the hosts in the CRUSH (if removing all OSDs for
   the host, set it to zero)
   4. If you have new OSDs to add, add them into the cluster now
   5. Once all OSDs changes have been entered, unset norecover and
   nobackfill
   6. This will migrate the data off the old OSDs and onto the new OSDs in
   one swoop.
   7. Once the data migration is complete, set norecover and nobackfill on
   the cluster again.
   8. Remove the old OSDs
   9. Unset norecover and nobackfill

The theory is that by setting the host weights to 0, removing the
OSDs/hosts later should minimize the data movement afterwards because the
algorithm should have already dropped it out as a candidate for placement.

If this works right, then you basically queue up a bunch of small changes,
do one data movement, always keep all copies of your objects online and
minimize the impact of the data movement by leveraging both your old and
new hardware at the same time.

If you try this, please report back on your experience. I'm might try it in
my lab, but I'm really busy at the moment so I don't know if I'll get to it
real soon.

On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic 
wrote:

> Hi Robert,
>
> it seems I have not listened well on your advice - I set osd to out,
> instead of stoping it - and now instead of some ~ 3% of degraded objects,
> now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing
> is happening again, but this is small percentage..
>
> Do you know if later when I remove this OSD from crush map - no more data
> will be rebalanced (as per CEPH official documentation) - since already
> missplaced objects are geting distributed away to all other nodes ?
>
> (after service ceph stop osd.0 - there was 2.45% degraded data - but no
> backfilling was happening for some reason...it just stayed degraded... so
> this is a reason why I started back the OSD, and then set it to out...)
>
> Thanks
>
> On 4 March 2015 at 17:54, Andrija Panic  wrote:
>
>> Hi Robert,
>>
>> I already have this stuff set. CEph is 0.87.0 now...
>>
>> Thanks, will schedule this for weekend, 10G network and 36 OSDs - should
>> move data in less than 8h per my last experineced that was arround8h, but
>> some 1G OSDs were included...
>>
>> Thx!
>>
>> On 4 March 2015 at 17:49, Robert LeBlanc  wrote:
>>
>>> You will most likely have a very high relocation percentage. Backfills
>>> always are more impactful on smaller clusters, but "osd max backfills"
>>> should be what you need to help reduce the impact. The default is 10,
>>> you will want to use 1.
>>>
>>> I didn't catch which version of Ceph you are running, but I think
>>> there was some priority work done in firefly to help make backfills
>>> lower priority. I think it has gotten better in later versions.
>>>
>>> On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic 
>>> wrote:
>>> > Thank you Rober - I'm wondering when I do remove total of 7 OSDs from
>>> crush
>>> > map - weather that will cause more than 37% of data moved (80% or
>>> whatever)
>>> >
>>> > I'm also wondering if the thortling that I applied is fine or not - I
>>> will
>>> > introduce the osd_recovery_delay_start 10sec as Irek said.
>>> >
>>> > I'm just wondering hom much will be the performance impact, because:
>>> > - when stoping OSD, the impact while backfilling was fine more or a
>>> less - I
>>> > can leave with this
>>> > - when I removed OSD from cursh map - first 1h or so, impact was
>>> tremendous,
>>> > and later on during recovery process impact was much less but still
>>> > noticable...
>>> >
>>> > Thanks for the t

Re: [ceph-users] Prioritize Heartbeat packets

2015-03-06 Thread Robert LeBlanc
I see that Jian Wen has done work on this for 0.94. I tried looking through
the code to see if I can figure out how to configure this new option, but
it all went over my head pretty quick.

Can I get a brief summary on how to set the priority of heartbeat packets
or where to look in the code to figure it out?

Thanks,

On Thu, Aug 28, 2014 at 2:01 AM, Daniel Swarbrick <
daniel.swarbr...@profitbricks.com> wrote:

> On 28/08/14 02:56, Sage Weil wrote:
> > I seem to remember someone telling me there were hooks/hints you could
> > call that would tag either a socket or possibly data on that socket with
> a
> > label for use by iptables and such.. but I forget what it was.
> >
>
> Something like setsockopt() SO_MARK?
>
>*SO_MARK *(since Linux 2.6.25)
>   Set the mark for each packet sent through this socket
> (similar
>   to the netfilter MARK target but socket-based).  Changing the
>   mark can be used for mark-based routing without netfilter or
>   for packet filtering.  Setting this option requires the
>   *CAP_NET_ADMIN *capability.
>
> Alternatively, directly set IP_TOS options on the socket, or SO_PRIORITY
> which sets the IP TOS bits as well.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Prioritize Heartbeat packets

2015-03-06 Thread Robert LeBlanc
Hidden HTML ... trying agin...

-- Forwarded message --
From: Robert LeBlanc 
Date: Fri, Mar 6, 2015 at 5:20 PM
Subject: Re: [ceph-users] Prioritize Heartbeat packets
To: "ceph-users@lists.ceph.com" ,
ceph-devel 


I see that Jian Wen has done work on this for 0.94. I tried looking
through the code to see if I can figure out how to configure this new
option, but it all went over my head pretty quick.

Can I get a brief summary on how to set the priority of heartbeat
packets or where to look in the code to figure it out?

Thanks,

On Thu, Aug 28, 2014 at 2:01 AM, Daniel Swarbrick
 wrote:
>
> On 28/08/14 02:56, Sage Weil wrote:
> > I seem to remember someone telling me there were hooks/hints you could
> > call that would tag either a socket or possibly data on that socket with a
> > label for use by iptables and such.. but I forget what it was.
> >
>
> Something like setsockopt() SO_MARK?
>
>*SO_MARK *(since Linux 2.6.25)
>   Set the mark for each packet sent through this socket (similar
>   to the netfilter MARK target but socket-based).  Changing the
>   mark can be used for mark-based routing without netfilter or
>   for packet filtering.  Setting this option requires the
>   *CAP_NET_ADMIN *capability.
>
> Alternatively, directly set IP_TOS options on the socket, or SO_PRIORITY
> which sets the IP TOS bits as well.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize Heartbeat packets

2015-03-09 Thread Robert LeBlanc
I've found commit 9b9a682fe035c985e416ee1c112fa58f9045a27c and I see
that when 'osd heartbeat use min delay socket = true' it will mark the
packet with DSCP CS6. Based on the setting of the socket in
msg/simple/Pipe.cc is it possible that this can apply to both OSD and
monitor? I don't understand the code enough to know how the
set_socket_options() is called from the OSD and monitor.

If this applies to both monitor and OSD, would it be better to rename
the option to a more generic name?

Thanks,

On Sat, Mar 7, 2015 at 4:23 PM, Daniel Swarbrick
 wrote:
> Judging by the commit, this ought to do the trick:
>
> osd heartbeat use min delay socket = true
>
> On 07/03/15 01:20, Robert LeBlanc wrote:
>>
>> I see that Jian Wen has done work on this for 0.94. I tried looking
>> through the code to see if I can figure out how to configure this new
>> option, but it all went over my head pretty quick.
>>
>> Can I get a brief summary on how to set the priority of heartbeat
>> packets or where to look in the code to figure it out?
>>
>> Thanks,
>>
>> On Thu, Aug 28, 2014 at 2:01 AM, Daniel Swarbrick
>> > <mailto:daniel.swarbr...@profitbricks.com>> wrote:
>>
>> On 28/08/14 02:56, Sage Weil wrote:
>> > I seem to remember someone telling me there were hooks/hints you
>> could
>> > call that would tag either a socket or possibly data on that socket
>> with a
>> > label for use by iptables and such.. but I forget what it was.
>> >
>>
>> Something like setsockopt() SO_MARK?
>>
>> *SO_MARK *(since Linux 2.6.25)
>>Set the mark for each packet sent through this socket
>> (similar
>>to the netfilter MARK target but socket-based).
>> Changing the
>>mark can be used for mark-based routing without
>> netfilter or
>>for packet filtering.  Setting this option requires the
>>*CAP_NET_ADMIN *capability.
>>
>> Alternatively, directly set IP_TOS options on the socket, or
>> SO_PRIORITY
>> which sets the IP TOS bits as well.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize Heartbeat packets

2015-03-09 Thread Robert LeBlanc
Jian,

Thanks for the clarification. I'll mark traffic destined for the
monitors as well. We are getting ready to put our first cluster into
production. If you are interested we will be testing the heartbeat
priority to see if we can saturate the network (not an easy task for
40 Gb) and keep the cluster from falling apart. Our network team is
marking COS based on the DSCP and enforcing priority. We have three
VLANs on bonded 40 GbE, management, storage (monitors, clients, OSDs),
and cluster (replication). We have three priority classes management
(heartbeats on all VLANs, SSH, DNS, etc), storage traffic (no
marking), and replication (scavenger class). We are interested to see
how things pan out.

Thanks,
Robert

On Mon, Mar 9, 2015 at 8:58 PM, Jian Wen  wrote:
> Only OSD calls set_socket_priority().
> See  https://github.com/ceph/ceph/pull/3353
>
> On Tue, Mar 10, 2015 at 3:36 AM, Robert LeBlanc  wrote:
>> I've found commit 9b9a682fe035c985e416ee1c112fa58f9045a27c and I see
>> that when 'osd heartbeat use min delay socket = true' it will mark the
>> packet with DSCP CS6. Based on the setting of the socket in
>> msg/simple/Pipe.cc is it possible that this can apply to both OSD and
>> monitor? I don't understand the code enough to know how the
>> set_socket_options() is called from the OSD and monitor.
>>
>> If this applies to both monitor and OSD, would it be better to rename
>> the option to a more generic name?
>>
>> Thanks,
>>
>> On Sat, Mar 7, 2015 at 4:23 PM, Daniel Swarbrick
>>  wrote:
>>> Judging by the commit, this ought to do the trick:
>>>
>>> osd heartbeat use min delay socket = true
>>>
>>> On 07/03/15 01:20, Robert LeBlanc wrote:
>>>>
>>>> I see that Jian Wen has done work on this for 0.94. I tried looking
>>>> through the code to see if I can figure out how to configure this new
>>>> option, but it all went over my head pretty quick.
>>>>
>>>> Can I get a brief summary on how to set the priority of heartbeat
>>>> packets or where to look in the code to figure it out?
>>>>
>>>> Thanks,
>>>>
>>>> On Thu, Aug 28, 2014 at 2:01 AM, Daniel Swarbrick
>>>> >>> <mailto:daniel.swarbr...@profitbricks.com>> wrote:
>>>>
>>>> On 28/08/14 02:56, Sage Weil wrote:
>>>> > I seem to remember someone telling me there were hooks/hints you
>>>> could
>>>> > call that would tag either a socket or possibly data on that socket
>>>> with a
>>>> > label for use by iptables and such.. but I forget what it was.
>>>> >
>>>>
>>>> Something like setsockopt() SO_MARK?
>>>>
>>>> *SO_MARK *(since Linux 2.6.25)
>>>>Set the mark for each packet sent through this socket
>>>> (similar
>>>>to the netfilter MARK target but socket-based).
>>>> Changing the
>>>>mark can be used for mark-based routing without
>>>> netfilter or
>>>>for packet filtering.  Setting this option requires the
>>>>*CAP_NET_ADMIN *capability.
>>>>
>>>> Alternatively, directly set IP_TOS options on the socket, or
>>>> SO_PRIORITY
>>>> which sets the IP TOS bits as well.
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> <mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Best,
>
> Jian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
If I remember right, the mon key has to be the same between all the mon
hosts. I don't think I added an admin key to my second mon, it got all the
other keys once it joined the mon closure. I do remember the join taking a
while. Have you checked the firewall to make sure traffic is allowed? I
don't remember if you said you checked it.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 11, 2015 8:08 PM, "Jesus Chavez (jeschave)" 
wrote:

>  Thanks Steffen I have followed everything not sure what is going on, the
> mon keyring and client admin are individual? Per mon host? Or do I need to
> copy from the first initial mon node?
>
>  Thanks again!
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 11, 2015, at 6:28 PM, Steffen W Sørensen  wrote:
>
>
> On 12/03/2015, at 00.55, Jesus Chavez (jeschave) 
> wrote:
>
> can anybody tell me a good blog link that explain how to add monitor? I
> have tried manually and also with ceph-deploy without success =(
>
> Dunno if these might help U:
>
>
> http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-a-monitor-manual
>
> http://cephnotes.ksperis.com/blog/2013/08/29/mon-failed-to-start
>
> /Steffen
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
Here is the procedure I wrote for our internal use (it is still a work in
progress) and may help you:

*Creating the First Monitor*

Once you Ceph installed, DNS and networking configured and have a ceph.conf
file built, you are ready to bootstrap the first monitor. The UUID is the
same from the ceph.conf file generated earlier, cluster-name is the name of
the Ceph cluster, usually just 'ceph', hostname is the short name of the
host and must match `hostname -s`.

1. SSH into the monitor host
2. Create the monitor keyring

ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon.
--cap mon 'allow *'

3. Create the admin keyring

ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
--gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow
*' --cap mds 'allow'

4. Add the admin key to the monitor keyring so the Admin user can manage
the cluster

ceph-authtool /tmp/ceph.mon.keyring --import-keyring
/etc/ceph/ceph.client.admin.keyring

5. Create the initial monitor map

monmaptool --create --add {hostname} {ip-address} --fsid {uuid} /tmp/monmap

6. Create the directory the monitor will store persistent data

sudo mkdir /var/lib/ceph/mon/{cluster-name}-{hostname}

7. Create the file structure for the monitor

ceph-mon --mkfs -i {hostname} --monmap /tmp/monmap --keyring
/tmp/ceph.mon.keyring

8. Let the monitor know that everything is ready to go

sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/done

9. Start the Monitor service

sudo start ceph-mon id=node1

10. Set the Monitor service to start at boot

*Adding Additional Monitors*

Adding additional monitors will make the cluster less susceptible to
outages due to hardware running the monitors going offline. As monitors are
added, the load on each monitor can both increase and decrease, so it is
import to understand when a new set of monitors should be added (to
understand why two monitors should be added, please see Monitors). Having
more monitors reduces the number of clients that each monitor has to
service. However, each time a CRUSH map changes (an OSD is marked out, a
new OSD is added, etc), a majority of monitors have to agree on the CRUSH
changes in the PAXOS algorithm and the CRUSH map has to be updated on each
Monitor. If the cluster is experiencing many CRUSH map changes it can add
additional load on the Monitors.

Monitors perform many file sync operations and are sensitive to latencies
with these operations. The large number of file sync operations can be very
disruptive to to OSD processes especially if residing on the same
traditional rotational disk. It is best to have a Monitor on its own
dedicated hardware. If dedicated hardware is not an option, then locate the
Monitor store on an SSD that is not primarily used for an OSD, an SSD
journal would be OK.

The process of adding a new monitor is detailed in (
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-a-monitor-manual)
and is outlined as follows:

   1. Copy the Monitor key and Monitor map from a running Monitor to the
   new monitor.
   2. Create a monitor directory on the new monitor.
   3. Add the new monitor to the Monitor map.
   4. Start the new monitor.


On Thu, Mar 12, 2015 at 9:58 AM, Jesus Chavez (jeschave)  wrote:

>  Hi Robert yes I did disable completely actually with chkconfig off for
> not take the service up when booting, I have 2 networks 1 with internet for
> yum purposes and the network for the public network so before any
> configuration I specified on ceph.conf that public network but I am not
> sure if it is the cause for something...
> The thing is that I am not sure about which steps should run in the new
> monitor host and which one should I run in the initial monitor, it seems
> like step 3 and 4 that is generate keyring and mapping should be done in
> initial monitor server and also step 5 that is mkfs becuase if I try to run
> those steps in the new monitor host didnt work cause cant find the keys :(
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 12, 2015, at 7:54 AM, Robert LeBlanc  wrote:
>
>   If I remember right, the mon key has to be the same between all the mon
> hosts. I don't think I added an admin key to my second mon, it got all the
> other keys once it joined the mon closure. I do remember the join taking a
> while. Have you checked the firewall to make sure traffic is allowed? I
> don't remember if you said you checked it.
>
> Robert LeBlanc
>
> Sent from a mobile device please excuse any typos.
> On Mar 11, 2015 8:08 PM, "Jesus Chavez (jeschave)" 
> wrote:
>
>>  Thanks Steffen I have followed everything not sure what is going on,
>

Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
That command (ceph mon add  [:]) can be run from any
client in the cluster with the admin key, it is a general Ceph command.

On Thu, Mar 12, 2015 at 10:33 AM, Jesus Chavez (jeschave) <
jesch...@cisco.com> wrote:

>  Great :) so just 1 point more, step 4 in adding monitors (Add the
> new monitor to the Monitor map.) this command actually runs in the new
> monitor right?
>
>  Thank you so much!
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 12, 2015, at 10:06 AM, Robert LeBlanc  wrote:
>
>  Add the new monitor to the Monitor map.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
ent. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>  On Mar 12, 2015, at 10:33 AM, Jesus Chavez (jeschave) 
> wrote:
>
>  Great :) so just 1 point more, step 4 in adding monitors (Add the
> new monitor to the Monitor map.) this command actually runs in the new
> monitor right?
>
>  Thank you so much!
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 12, 2015, at 10:06 AM, Robert LeBlanc  wrote:
>
>  Add the new monitor to the Monitor map.
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd replication

2015-03-12 Thread Robert LeBlanc
The primary OSD for an object is responsible for the replication. In a
healthy cluster the workflow is as such:

   1. Client looks up primary OSD in CRUSH map
   2. Client sends object to be written to primary OSD
   3. Primary OSD looks up replication OSD(s) in its CRUSH map
   4. Primary OSD contacts replication OSD(s) and sends objects
   5. All OSDs commit object to local journal
   6. Replication OSD(s) report back to primary that the write is committed
   7. On the primary OSD, after ack of write from replication OSD(s) and
   it's own local journal does the primary OSD ack the write to the client
   8. Client receives ack and knows that the object is safely stored and
   replicated in the cluster

Ceph has a strong consistency model and will not tell the client the write
is complete until it is replicated in the cluster.

On Thu, Mar 12, 2015 at 12:26 PM, tombo  wrote:

>  Hello,
>
> I need to understand how replication is accomplished or who is taking care
> of replication, osd itsef? Because we are using librados to read/write to
> cluster. If librados is not doing parallel writes according desired number
> of object copies, it could  happen that objects are in journal waiting for
> flush and osd went down so objects are hung in journal? Or do they already
> have their copies on other osds which means that librados is resposible for
> redundancy?
>
> Thanks for explanation.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
I'm not sure why you are having such a hard time. I added monitors (and
removed them) on CentOS 7 by following what I had. The thing that kept
tripping me up was firewalld. Once I either shut it off or created a
service for Ceph, it worked fine.

What is in in /var/log/ceph/ceph-mon.tauro.log when it is hunting for a
monitor?

On Thu, Mar 12, 2015 at 2:31 PM, Jesus Chavez (jeschave)  wrote:

>  Hi Steffen I already had them in my configuration [image: 😞] I am
> stress now because it seems like none of the methods did help :( this is
> bad I think I am going to get back to rhel6.6 where xfs is a damn add on
> and I have to install from centos repo make ceph like patch :( but at last
> with RHEL6.6 work, Shame on RHEL7 the next time I will sell everything with
> ubuntu lol
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 12, 2015, at 1:56 PM, Steffen W Sørensen  wrote:
>
>
>  On 12/03/2015, at 20.00, Jesus Chavez (jeschave) 
> wrote:
>
>  Thats what I thought and did actually the monmap and keyring were copied
> to the new monitor and there with 2 elements I did the mkfs thing and still
> have that Messages, do I need osd configured?  Because I have non and I am
> not sure if it is requiered ... Also is weird that monmap is not taking the
> new monitor I think I should try to configure the 3 monitors as initial
> monitors an see how it goes
>
> Dunno about your config, but I seem to remember when I decommissioned one
> mon instance and addition of a new on another node that I needed to have
> mon. section in ceph.conf inorder to be able to start the monitor.
>
> ceph.conf snippet:
>
>   [osd]
>  osd mount options xfs =
> "rw,noatime,nobarrier,logbsize=256k,logbufs=8,allocsize=4M,attr2,delaylog,inode64,noquota"
>  keyring = /var/lib/ceph/osd/ceph-$id/keyring
>  ; Tuning
>   ;# By default, Ceph makes 3 replicas of objects. If you want to
> make four
>  ;# copies of an object the default value--a primary copy and three
> replica
>  ;# copies--reset the default values as shown in 'osd pool default size'.
>  ;# If you want to allow Ceph to write a lesser number of copies in a
> degraded
>  ;# state, set 'osd pool default min size' to a number less than the
>  ;# 'osd pool default size' value.
>
>  osd pool default size = 2  # Write an object 2 times.
>  osd pool default min size = 1 # Allow writing one copy in a degraded
> state.
>
>  ;# Ensure you have a realistic number of placement groups. We recommend
>  ;# approximately 100 per OSD. E.g., total number of OSDs multiplied by
> 100
>  ;# divided by the number of replicas (i.e., osd pool default size). So
> for
>  ;# 10 OSDs and osd pool default size = 3, we'd recommend approximately
>  ;# (100 * 10) / 3 = 333.
>
>  ;# got 24 OSDs => 1200 pg, but this is not a full production site, so
> let's settle for 1024 to lower cpu load
>  osd pool default pg num = 1024
>  osd pool default pgp num = 1024
>
>  client cache size = 131072
>  osd client op priority = 40
>  osd op threads = 8
>  osd client message size cap = 512
>  filestore min sync interval = 10
>  filestore max sync interval = 60
>  ;filestore queue max bytes = 10485760
>  ;filestore queue max ops = 50
>  ;filestore queue committing max ops = 500
>  ;filestore queue committing max bytes = 104857600
>  ;filestore op threads = 2
>  recovery max active = 2
>  recovery op priority = 30
>  osd max backfills = 2
>  ; Journal Tuning
>  journal size = 5120
>  ;journal max write bytes = 1073714824
>  ;journal max write entries = 1
>  ;journal queue max ops = 5
>  ;journal queue max bytes = 1048576
>
>
>
>
>  [mon.0]
>  host = node4
>  mon addr = 10.0.3.4:6789
>
>  [mon.1]
>  host = node2
>  mon addr = 10.0.3.2:6789
>
>  [mon.2]
>  host = node1
>  mon addr = 10.0.3.1:6789
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Add monitor unsuccesful

2015-03-12 Thread Robert LeBlanc
We all get burned by the firewall at one time or another. Hence the name
'fire'wall! :) I'm glad you got it working.

On Thu, Mar 12, 2015 at 2:53 PM, Jesus Chavez (jeschave)  wrote:

>  This is awkard Robert all this time was the firewall :( I cant believe I
> spent 2 days trying to figure out :(. Thank you so much!
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 12, 2015, at 2:48 PM, Robert LeBlanc  wrote:
>
>   I'm not sure why you are having such a hard time. I added monitors (and
> removed them) on CentOS 7 by following what I had. The thing that kept
> tripping me up was firewalld. Once I either shut it off or created a
> service for Ceph, it worked fine.
>
>  What is in in /var/log/ceph/ceph-mon.tauro.log when it is hunting for a
> monitor?
>
> On Thu, Mar 12, 2015 at 2:31 PM, Jesus Chavez (jeschave) <
> jesch...@cisco.com> wrote:
>
>>  Hi Steffen I already had them in my configuration  I
>> am stress now because it seems like none of the methods did help :( this is
>> bad I think I am going to get back to rhel6.6 where xfs is a damn add on
>> and I have to install from centos repo make ceph like patch :( but at last
>> with RHEL6.6 work, Shame on RHEL7 the next time I will sell everything with
>> ubuntu lol
>>
>>
>>
>> * Jesus Chavez*
>> SYSTEMS ENGINEER-C.SALES
>>
>> jesch...@cisco.com
>> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
>> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>>
>> CCIE - 44433
>>
>> On Mar 12, 2015, at 1:56 PM, Steffen W Sørensen  wrote:
>>
>>
>>  On 12/03/2015, at 20.00, Jesus Chavez (jeschave) 
>> wrote:
>>
>>  Thats what I thought and did actually the monmap and keyring were
>> copied to the new monitor and there with 2 elements I did the mkfs thing
>> and still have that Messages, do I need osd configured?  Because I have non
>> and I am not sure if it is requiered ... Also is weird that monmap is not
>> taking the new monitor I think I should try to configure the 3 monitors as
>> initial monitors an see how it goes
>>
>> Dunno about your config, but I seem to remember when I decommissioned one
>> mon instance and addition of a new on another node that I needed to have
>> mon. section in ceph.conf inorder to be able to start the monitor.
>>
>> ceph.conf snippet:
>>
>>  [osd]
>> osd mount options xfs =
>> "rw,noatime,nobarrier,logbsize=256k,logbufs=8,allocsize=4M,attr2,delaylog,inode64,noquota"
>> keyring = /var/lib/ceph/osd/ceph-$id/keyring
>> ; Tuning
>>  ;# By default, Ceph makes 3 replicas of objects. If you want to
>> make four
>> ;# copies of an object the default value--a primary copy and three
>> replica
>> ;# copies--reset the default values as shown in 'osd pool default size'.
>> ;# If you want to allow Ceph to write a lesser number of copies in a
>> degraded
>> ;# state, set 'osd pool default min size' to a number less than the
>> ;# 'osd pool default size' value.
>>
>>  osd pool default size = 2  # Write an object 2 times.
>> osd pool default min size = 1 # Allow writing one copy in a degraded
>> state.
>>
>>  ;# Ensure you have a realistic number of placement groups. We recommend
>> ;# approximately 100 per OSD. E.g., total number of OSDs multiplied by
>> 100
>> ;# divided by the number of replicas (i.e., osd pool default size). So for
>> ;# 10 OSDs and osd pool default size = 3, we'd recommend approximately
>> ;# (100 * 10) / 3 = 333.
>>
>>  ;# got 24 OSDs => 1200 pg, but this is not a full production site, so
>> let's settle for 1024 to lower cpu load
>> osd pool default pg num = 1024
>> osd pool default pgp num = 1024
>>
>>  client cache size = 131072
>> osd client op priority = 40
>> osd op threads = 8
>> osd client message size cap = 512
>> filestore min sync interval = 10
>> filestore max sync interval = 60
>> ;filestore queue max bytes = 10485760
>> ;filestore queue max ops = 50
>> ;filestore queue committing max ops = 500
>> ;filestore queue committing max bytes = 104857600
>> ;filestore op threads = 2
>> recovery max active = 2
>> recovery op priority = 30
>> osd max backfills = 2
>> ; Journal Tuning
>> journal size = 5120
>> ;journal max write bytes = 1073714824
>> ;journal max write entries = 1
>> ;journal queue max ops = 5
>> ;journal queue max bytes = 1048576
>>
>>
>>
>>
>>  [mon.0]
>> host = node4
>> mon addr = 10.0.3.4:6789
>>
>>  [mon.1]
>> host = node2
>> mon addr = 10.0.3.2:6789
>>
>>  [mon.2]
>> host = node1
>> mon addr = 10.0.3.1:6789
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange Monitor Appearance after Update

2015-03-12 Thread Robert LeBlanc
Two monitors don't work very well and really don't but you anything. I
would either add another monitor or remove one. Paxos is most effective
with an odd number of monitors.

I don't know about the problem you are experiencing and how to help you. An
even number of monitors should work.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 12, 2015 7:19 PM, "Georgios Dimitrakakis" 
wrote:

> I forgot to say that the monitors form a quorum and the cluster's health
> is OK
> so there aren't any serious troubles other than the annoying message.
>
> Best,
>
> George
>
>  Hi all!
>>
>> I have updated from 0.80.8 to 0.80.9 and every time I try to restart
>> CEPH a monitor a strange monitor is appearing!
>>
>> Here is the output:
>>
>>
>> #/etc/init.d/ceph restart mon
>> === mon.master ===
>> === mon.master ===
>> Stopping Ceph mon.master on master...kill 10766...done
>> === mon.master ===
>> Starting Ceph mon.master on master...
>> Starting ceph-create-keys on master...
>> === mon.master_192.168.0.10 ===
>> === mon.master_192.168.0.10 ===
>> Stopping Ceph mon.master_192.168.0.10 on master...done
>> === mon.master_192.168.0.10 ===
>> Starting Ceph mon.master_192.168.0.10 on master...
>> 2015-03-13 03:06:22.964493 7f06256fa7a0 -1
>> mon.master_192.168.0.10@-1(probing) e2 not in monmap and have been in
>> a quorum before; must have been removed
>> 2015-03-13 03:06:22.964497 7f06256fa7a0 -1
>> mon.master_192.168.0.10@-1(probing) e2 commit suicide!
>> 2015-03-13 03:06:22.964499 7f06256fa7a0 -1 failed to initialize
>> failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i master_192.168.0.10
>> --pid-file /var/run/ceph/mon.master_192.168.0.10.pid -c
>> /etc/ceph/ceph.conf --cluster ceph '
>>
>>
>> I have two monitors which are:
>>
>> mon.master and mon.client1
>>
>> and have defined them in ceph.conf as:
>>
>> mon_initial_members = master,client1
>> mon_host = 192.168.0.10,192.168.0.11
>>
>>
>>
>> Why is the "mon.master_192.168.0.10" appearing and how can I stop it
>> from happening?
>>
>>
>> The above is the problem on one node. Obviously the problem is
>> appearing on the other node as well but instead I have
>>
>> "mon.client1_192.168.0.11" appearing
>>
>>
>>
>> Any ideas?
>>
>>
>> Regards,
>>
>>
>> George
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange Monitor Appearance after Update

2015-03-12 Thread Robert LeBlanc
Having two monitors should not be causing the problem you are seeing like
you say. What is in /var/log/ceph/ceph.mon.*.log?

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 12, 2015 7:39 PM, "Georgios Dimitrakakis" 
wrote:

> Hi Robert!
>
> Thanks for the feedback! I am aware of the fact that the number of the
> monitors should be odd
> but this is a very basic setup just to test CEPH functionality and perform
> tasks there before
> doing it to our production cluster.
>
> So I am not concerned about that and I really don't believe that this is
> why the problem has appeared!
>
> What concerns me is how this "new" monitor that has the same name followed
> by an underscore and
> the IP address appeared out of nowhere and how to stop it!
>
> Regards,
>
> George
>
>  Two monitors dont work very well and really dont but you anything. I
>> would either add another monitor or remove one. Paxos is most
>> effective with an odd number of monitors.
>>
>> I dont know about the problem you are experiencing and how to help
>> you. An even number of monitors should work.
>>
>> Robert LeBlanc
>>
>> Sent from a mobile device please excuse any typos.
>> On Mar 12, 2015 7:19 PM, "Georgios Dimitrakakis"  wrote:
>>
>>  I forgot to say that the monitors form a quorum and the clusters
>>> health is OK
>>> so there arent any serious troubles other than the annoying
>>> message.
>>>
>>> Best,
>>>
>>> George
>>>
>>>  Hi all!
>>>>
>>>> I have updated from 0.80.8 to 0.80.9 and every time I try to
>>>> restart
>>>> CEPH a monitor a strange monitor is appearing!
>>>>
>>>> Here is the output:
>>>>
>>>> #/etc/init.d/ceph restart mon
>>>> === mon.master ===
>>>> === mon.master ===
>>>> Stopping Ceph mon.master on master...kill 10766...done
>>>> === mon.master ===
>>>> Starting Ceph mon.master on master...
>>>> Starting ceph-create-keys on master...
>>>> === mon.master_192.168.0.10 ===
>>>> === mon.master_192.168.0.10 ===
>>>> Stopping Ceph mon.master_192.168.0.10 on master...done
>>>> === mon.master_192.168.0.10 ===
>>>> Starting Ceph mon.master_192.168.0.10 on master...
>>>> 2015-03-13 03:06:22.964493 7f06256fa7a0 -1
>>>> mon.master_192.168.0.10@-1(probing) e2 not in monmap and have
>>>> been in
>>>> a quorum before; must have been removed
>>>> 2015-03-13 03:06:22.964497 7f06256fa7a0 -1
>>>> mon.master_192.168.0.10@-1(probing) e2 commit suicide!
>>>> 2015-03-13 03:06:22.964499 7f06256fa7a0 -1 failed to initialize
>>>> failed: ulimit -n 32768;  /usr/bin/ceph-mon -i
>>>> master_192.168.0.10
>>>> --pid-file /var/run/ceph/mon.master_192.168.0.10.pid -c
>>>> /etc/ceph/ceph.conf --cluster ceph
>>>>
>>>> I have two monitors which are:
>>>>
>>>> mon.master and mon.client1
>>>>
>>>> and have defined them in ceph.conf as:
>>>>
>>>> mon_initial_members = master,client1
>>>> mon_host = 192.168.0.10,192.168.0.11
>>>>
>>>> Why is the "mon.master_192.168.0.10" appearing and how can I stop
>>>> it
>>>> from happening?
>>>>
>>>> The above is the problem on one node. Obviously the problem is
>>>> appearing on the other node as well but instead I have
>>>>
>>>> "mon.client1_192.168.0.11" appearing
>>>>
>>>> Any ideas?
>>>>
>>>> Regards,
>>>>
>>>> George
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com [1]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]
>>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com [3]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
>>>
>>
>>
>> Links:
>> --
>> [1] mailto:ceph-users@lists.ceph.com
>> [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> [3] mailto:ceph-users@lists.ceph.com
>> [4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> [5] mailto:gior...@acmac.uoc.gr
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD booting down

2015-03-12 Thread Robert LeBlanc
My experience with CentOS 7 is that ceph-disk works the best. Systemd has a
fit with extra arguments common in the upstart and SysV scripts. Ceph
installs udev rules that will automatically mount and start OSDs.

The udev rules look for GPT partition UUIDs that are set aside for Ceph to
find partitions that should be mounted and started. You can do it by hand
(I've done it to understand the process) but it is a lot of work. Since
we've gone to using ceph-disk we haven't had any problems with OSDs
starting at boot. If I need to restart an OSD, I just kill the process and
then run ceph-disk activate. Ceph-disk is just a script so you can open it
up and take a look.

So I guess it depends on which automatically you want to happen.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 12, 2015 9:54 PM, "Jesus Chavez (jeschave)" 
wrote:

>  Hi all, after adding osds manually and reboot the server the osd didnt
> come up automatically am I missing something?
>
>  Thanks
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication question

2015-03-13 Thread Robert LeBlanc
That is correct, you make a tradeoff between space, performance and
resiliency. By reducing replication from 3 to 2, you will get more space
and likely more performance (less overhead from third copy), but it comes
at the expense of being able to recover your data when there are multiple
failures.

On Fri, Mar 13, 2015 at 4:13 AM, Thomas Foster 
wrote:

> Ok, now if I run a lab and the data is somewhat important but I can bare
> losing the data, couldn't I shrink the pool replica count and that
> increases the amount of storage I can use without using erasure coding?
>
> So for 145TB with a replica of 3 = ~41 TB total in the cluster
>
> But if that same clusters replica was decreased to 2 I could possibly get
> 145TB / 2 - overhead for cluster and get ~65TB in the cluster at one
> time..correct?
>
> Thanks in advance!
>
>
> On Mar 12, 2015 11:53 AM, "Kamil Kuramshin" 
> wrote:
>
>>  For example, here is my confuguration:
>>
>> superuser@admin:~$ ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 242T  209T   20783G  8.38
>> POOLS:
>> NAME  ID USED  %USED MAX AVAIL
>> OBJECTS
>> ec_backup-storage 4  9629G  3.88  137T
>> 2465171
>> cache 5   136G  0.0638393M
>> 35036
>> block-devices 6  1953G  0.7970202G
>> 500060
>>
>>
>> *ec_backup-storage* - is Erasure Encoded pool, k=2, m=1 (default)
>> *cache* - is replicated pool consisting dedicated 12xSSDx60Gb disks,
>> replica size=3, used as cache tier for EC pool
>> *block-devices* - is replicated pool, replica size=3, using same OSD's
>> that in Erasure Encoded pool
>>
>> On* '**MAX AVAIL**'* column you can see that EC pool currently has
>> *137Tb* of free space, but in same time if we will write to replicated
>> pool there is only *70Tb, *but *both* pools are on the *same* *OSD's. *So
>> using EC pool saves 2 times more effective space in my case!
>>
>> 12.03.2015 17:50, Thomas Foster пишет:
>>
>> Thank you!  That helps alot.
>> On Mar 12, 2015 10:40 AM, "Steve Anthony"  wrote:
>>
>>>  Actually, it's more like 41TB. It's a bad idea to run at near full
>>> capacity (by default past 85%) because you need some space where Ceph can
>>> replicate data as part of its healing process in the event of disk or node
>>> failure. You'll get a health warning when you exceed this ratio.
>>>
>>> You can use erasure coding to increase the amount of data you can store
>>> beyond 41TB, but you'll still need some replicated disk as a caching layer
>>> in front of the erasure coded pool if you're using RBD. See:
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/036430.html
>>>
>>> As to how much space you can save with erasure coding, that will depend
>>> on if you're using RBD and need a cache layer and the values you set for k
>>> and m (number of data chunks and coding chunks). There's been some
>>> discussion on the list with regards to choosing those values.
>>>
>>> -Steve
>>>
>>> On 03/12/2015 10:07 AM, Thomas Foster wrote:
>>>
>>> I am looking into how I can maximize my space with replication, and I am
>>> trying to understand how I can do that.
>>>
>>>  I have 145TB of space and a replication of 3 for the pool and was
>>> thinking that the max data I can have in the cluster is ~47TB in my cluster
>>> at one time..is that correct?  Or is there a way to get more data into the
>>> cluster with less space using erasure coding?
>>>
>>>  Any help would be greatly appreciated.
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing 
>>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> --
>>> Steve Anthony
>>> LTS HPC Support Specialist
>>> Lehigh universitysma...@lehigh.edu
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Infiniband CLUS & PUB Network

2015-03-17 Thread Robert LeBlanc
We have a test cluster with IB. We have both networks over IPoIB on the
same IP subnet though (no cluster network configuration).

On Tue, Mar 17, 2015 at 12:02 PM, German Anders 
wrote:

>  Hi All,
>
>   Does anyone have Ceph implemented with Infiniband for Cluster and
> Public network?
>
> Thanks in advance,
>
>
>
> *German Anders*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Infiniband CLUS & PUB Network

2015-03-17 Thread Robert LeBlanc
I don't know what your disk performance in your OSD nodes are, but dual FDR
would probably be more than enough that I wouldn't worry about doing a
separate cluster network. The FDR card should have more capacity than your
PCI bus anyway.

Since Ceph does not use RDMA or native IB verbs yet, you won't see a ton of
improvement over your Ethernet implementation. Mellanox and others are
working to have Ceph use AccelIO which will provide RDMA messages for Ceph,
but there is still quite a bit of work to get it fully supported.

I haven't used the SX6036G, so I don't know what to expect or offer you in
that regard, but it may determine if you use the Mellanox driver set or if
you can use the OFED drivers (packaged with your distro). I personally like
using the distro packages even if they are a little more out of date, but
to support the gateway, you might need something newer.

We use connected mode on our cluster with MTU of 65520, but some people
have better performance with datagram mode and MTU of 4096. I think if you
are using RDMA, then datagram mode does perform better. You will need to
test it in your environment and with the drivers you choose to use.

As far as Ceph, currently you would configure it the same as Ethernet with
IPoIB. You just have to make sure you have a subnet manager up on the IB
fabric so that you can bring up the IPoIB interfaces. For some reason
CentOS7 doesn't bring up open-sm automatically, so I have to remember to
start it when I reboot one of the boxes, luckily it can be clustered very
easily.

I recommend searching through the mailing list archives, some tweaks for IB
have been given for memory contention scenarios (this will be important if
your OSDs don't have a lot of RAM since IB locks memory regions). Keep a
look out for progress on XIO on the mailing list to see when native IB
support will be in Ceph.

On Tue, Mar 17, 2015 at 12:13 PM, German Anders 
wrote:

>  Hi Robert,
>
>   How are you? Thanks a lot for the quick response. I would like to
> know if you could share some info on this. We have an existing Ceph cluster
> in production, with the following:
>
> 3 x MON servers with 10GbE ADPT DP (one port on the PUB network)
> 4 x OSD servers with 10GbE ADPT DP (one port on the PUB network and one
> port on the CLUS network)
> We are using Juniper EX4500 at 10GbE
>
> This cluster works fine but we need to improve performance on the network
> side, and we are going for other reason too, to implement Infiniband FDR
> 56Gb/s, I've spoke with Mellanox people and we are going to use (2) SX6036G
> gateway switches, so we can connect our existing ETH network (2 x 40G
> connections on each gateway), the gateways are going to be in an
> active/active configuration, also we are going to put one ADPT FDR DP on
> each node.
>
>  And then we are going to have two more switches SX6018F so use for CLUS
> Ceph network, and the PUB network of the Ceph nodes is going to be
> connected directly to the IB gateways switches.
> What considerations do we need to have in order to use IB with Ceph?,
> also, do you have any procedure to configure IB with Ceph?, like
> dependencies to install on the hosts, etc.
>
> Any help will really be appreciated.
>
> Thanks in advance,
>
>
>
> *German Anders*
>
> Storage System Engineer Leader
>
> *Despegar* | IT Team
>
> *office* +54 11 4894 3500 x3408
>
> *mobile* +54 911 3493 7262
> *mail* gand...@despegar.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --- Original message ---
> *Asunto:* Re: [ceph-users] Ceph + Infiniband CLUS & PUB Network
> *De:* Robert LeBlanc 
> *Para:* German Anders 
> *Cc:* ceph-users@lists.ceph.com 
> *Fecha:* Tuesday, 17/03/2015 15:06
>
> We have a test cluster with IB. We have both networks over IPoIB on the
> same IP subnet though (no cluster network configuration).
>
> On Tue, Mar 17, 2015 at 12:02 PM, German Anders 
> wrote:
>
>> Hi All,
>>
>>   Does anyone have Ceph implemented with Infiniband for Cluster and
>> Public network?
>>
>> Thanks in advance,
>>
>>
>>
>> *German Anders*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping OSD to physical device

2015-03-19 Thread Robert LeBlanc
Udev already provides some of this for you. Look in /dev/disk/by-*.
You can reference drives by UUID, id or path (for
SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across
reboots and hardware changes.

On Thu, Mar 19, 2015 at 1:10 PM, Colin Corr  wrote:
> Greetings Cephers,
>
> I have been lurking on this list for a while, but this is my first inquiry. I 
> have been playing with Ceph for the past 9 months and am in the process of 
> deploying a production Ceph cluster. I am seeking advice on an issue that I 
> have encountered. I do not believe it is a Ceph specific issue, but more of a 
> Linux issue. Technically, its not an issue, just undesired behaviour that I 
> am hoping someone here has encountered and can provide some insight as to a 
> work around.
>
> Basically, there are occasions when an OSD host machine gets rebooted. 
> Sometimes one or more drives does not spin up properly. This causes the OSD 
> to go offline, along with all other OSDs after it in sequence.
>
> I created my OSDs using the online docs with the Linux device name (ex. 
> /dev/sdc, sdd, sde, etc). So, osd.0 = /dev/sdc, osd.1 = /dev/sdd, osd.2 = 
> /dev/sde, osd.3 = dev/sdf, etc.
>
> But, if one of the drives fails/does not spin up, then Linux will rename the 
> drives. Example, /dev/sdd fails on reboot, so now osd.1 comes up with 
> /dev/sde, but /dev/sde is actually the osd.2 drive and osd.2 comes up with 
> what was the osd.3 drive, then they all fall offline in sequence after the 
> one failed osd.1.
>
> As expected, if I replace the failed drive and reboot, Linux enumerates the 
> drives and gives them the original device names and Ceph behaves properly by 
> marking the affected osd as down and out, while the remaining drives in 
> sequence come up and recover gracefully.
>
> Does anyone have any thoughts or experience with how one can ensure that 
> Linux device names will always map to the physical device ID? I was thinking 
> along the lines of a udev ruleset for the drives or something similar. Or, is 
> there a better way to create the OSD using the physical device ID? Basically, 
> some sort of way to ensure that a specific physical drive always gets mapped 
> to the same device name and OSD.
>
> Thanks for any insight or thoughts on this,
>
> Colin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping OSD to physical device

2015-03-19 Thread Robert LeBlanc
I don't use ceph-deploy, but using ceph-disk for creating the OSDs
automatically uses the by-partuuid reference for the journals (at
least I recall only using /dev/sdX for the journal reference, which is
what I have in my documentation). Since ceph-disk does all the
partitioning, it automatically finds the volume with udev, mounts it
in the correct location and accesses the journal on the right disk.

It also may be a limitation on the version of ceph-deploy/ceph-disk
you are using.

On Thu, Mar 19, 2015 at 5:54 PM, Colin Corr  wrote:
> On 03/19/2015 12:27 PM, Robert LeBlanc wrote:
>> Udev already provides some of this for you. Look in /dev/disk/by-*.
>> You can reference drives by UUID, id or path (for
>> SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across
>> reboots and hardware changes.
>
> Thanks for the quick responses. And to Kobi (off list) as well.
>
> It seems the optimal way to do this is to create the OSDs by ID in the first 
> place.
>
> So, for /dev/sde with a journal on /dev/sda5:
>
> root@osd1:~$ ls -l /dev/disk/by-id/ | grep sde
> lrwxrwxrwx 1 root root  9 Mar 19 23:36 
> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5TUCJX9 -> ../../sde
> lrwxrwxrwx 1 root root  9 Mar 19 23:36 wwn-0x50014ee20a66aefe -> ../../sde
>
> root@osd1:~$ ls -l /dev/disk/by-id/ | grep sda5
> lrwxrwxrwx 1 root root 10 Mar 19 23:36 
> ata-Crucial_CT480M500SSD1_14210C292B50-part5 -> ../../sda5
> lrwxrwxrwx 1 root root 10 Mar 19 23:36 wwn-0x500a07510c292b50-part5 -> 
> ../../sda5
>
> The deploy command looks like this:
>
> ceph-deploy --overwrite-conf osd create 
> osd1:/dev/disk/by-id/wwn-0x50014ee20a66aefe:/dev/disk/by-id/wwn-0x500a07510c292b50-part5
>
> And alternatively, create a udev rule set for existing devices.
>
> I haven't tested yet, but I am guessing that the udev rule for that same disk 
> (deployed as sde) would look something like this:
>
> KERNEL=="sde", SUBSYSTEM=="block", 
> DEVLINKS=="/dev/disk/by-id/wwn-0x50014ee20a66aefe"
>
>
> Many thanks for the assistance!
>
> Colin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Server Specific Pools

2015-03-20 Thread Robert LeBlanc
You can create CRUSH rulesets and then assign pools to different rulesets.

http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

On Thu, Mar 19, 2015 at 7:28 PM, Garg, Pankaj
 wrote:
> Hi,
>
>
>
> I have a Ceph cluster with both ARM and x86 based servers in the same
> cluster. Is there a way for me to define Pools or some logical separation
> that would allow me to use only 1 set of machines for a particular test.
>
> That way it makes easy for me to run tests either on x86 or ARM and do some
> comparison testing.
>
>
>
> Thanks
>
> Pankaj
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: OSD Forece Removal

2015-03-20 Thread Robert LeBlanc
Removing the OSD from the CRUSH map and deleting the auth key is how you
force remove an OSD. The OSD can no longer participate in the cluster, even
if it does come back to life. All clients forget about the OSD when the new
CRUSH map is distributed.

On Fri, Mar 20, 2015 at 11:19 AM, Jesus Chavez (jeschave) <
jesch...@cisco.com> wrote:

>  Any idea how to forcé remove ? Thanks
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> Begin forwarded message:
>
>  *From:* Stéphane DUGRAVOT 
> *Date:* March 20, 2015 at 3:49:11 AM CST
> *To:* "Jesus Chavez (jeschave)" 
> *Cc:* ceph-users 
> *Subject:* *Re: [ceph-users] OSD Forece Removal*
>
>
>
>  --
>
> Hi all, can anybody tell me how can I force delete  osds? the thing is
> that one node got corrupted because of outage, so there is no way to get
> those osd up and back, is there anyway to force the removal from
> ceph-deploy node?
>
>
>  Hi,
>  Try manual :
>
>-
>
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
>
>
>  Thanks
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <%2B52%2055%205267%203146>*
> Mobile: *+51 1 5538883255*
>
> CCIE - 44433
>
>
> Cisco.com 
>
>
>
>
>
>   Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
>  for
> Company Registration Information.
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-20 Thread Robert LeBlanc
We tested bcache and abandoned it for two reasons.

   1. Didn't give us any better performance than journals on SSD.
   2. We had lots of corruption of the OSDs and were rebuilding them
   frequently.

Since removing them, the OSDs have been much more stable.

On Fri, Mar 20, 2015 at 4:03 AM, Nick Fisk  wrote:

>
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Burkhard Linke
> > Sent: 20 March 2015 09:09
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
> >
> > Hi,
> >
> > On 03/19/2015 10:41 PM, Nick Fisk wrote:
> > > I'm looking at trialling OSD's with a small flashcache device over
> > > them to hopefully reduce the impact of metadata updates when doing
> > small block io.
> > > Inspiration from here:-
> > >
> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083
> > >
> > > One thing I suspect will happen, is that when the OSD node starts up
> > > udev could possibly mount the base OSD partition instead of
> > > flashcached device, as the base disk will have the ceph partition uuid
> > > type. This could result in quite nasty corruption.
> > I ran into this problem with an enhanceio based cache for one of our
> > database servers.
> >
> > I think you can prevent this problem by using bcache, which is also
> integrated
> > into the official kernel tree. It does not act as a drop in replacement,
> but
> > creates a new device that is only available if the cache is initialized
> correctly. A
> > GPT partion table on the bcache device should be enough to allow the
> > standard udev rules to kick in.
> >
> > I haven't used bcache in this scenario yet, and I cannot comment on its
> speed
> > and reliability compared to other solutions. But from the operational
> point of
> > view it is "safer" than enhanceio/flashcache.
>
> I did look at bcache, but there are a lot of worrying messages on the
> mailing list about hangs and panics that has discouraged me slightly from
> it. I do think it is probably the best solution, but I'm not convinced
> about
> the stability.
>
> >
> > Best regards,
> > Burkhard
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs issue

2015-03-20 Thread Robert LeBlanc
The weight can be based on anything, size, speed, capability, some random
value, etc. The important thing is that it makes sense to you and that you
are consistent.

Ceph by default (ceph-disk and I believe ceph-deploy) take the approach of
using size. So if you use a different weighting scheme, you should manually
add the OSDs, or "clean up" after using ceph-disk/ceph-deploy. Size works
well for most people, unless the disks are less than 10 GB so most people
don't bother messing with it.

On Fri, Mar 20, 2015 at 12:06 PM, Bogdan SOLGA 
wrote:

> Thank you for the clarifications, Sahana!
>
> I haven't got to that part, yet, so these details were (yet) unknown to
> me. Perhaps some information on the PGs weight should be provided in the
> 'quick deployment' page, as this issue might be encountered in the future
> by other users, as well.
>
> Kind regards,
> Bogdan
>
>
> On Fri, Mar 20, 2015 at 12:05 PM, Sahana  wrote:
>
>> Hi Bogdan,
>>
>>  Here is the link for hardware recccomendations :
>> http://ceph.com/docs/master/start/hardware-recommendations/#hard-disk-drives.
>> As per this link, minimum  size  reccommended for osds  is 1TB.
>>  Butt as Nick said, Ceph OSDs must be min. 10GB to get an weight of 0.01
>> Here is the snippet from crushmaps section of ceph docs:
>>
>> Weighting Bucket Items
>>
>> Ceph expresses bucket weights as doubles, which allows for fine
>> weighting. A weight is the relative difference between device capacities.
>> We recommend using 1.00 as the relative weight for a 1TB storage device.
>> In such a scenario, a weight of 0.5 would represent approximately 500GB,
>> and a weight of 3.00 would represent approximately 3TB. Higher level
>> buckets have a weight that is the sum total of the leaf items aggregated by
>> the bucket.
>>
>> Thanks
>>
>> Sahana
>>
>> On Fri, Mar 20, 2015 at 2:08 PM, Bogdan SOLGA 
>> wrote:
>>
>>> Thank you for your suggestion, Nick! I have re-weighted the OSDs and the
>>> status has changed to '256 active+clean'.
>>>
>>> Is this information clearly stated in the documentation, and I have
>>> missed it? In case it isn't - I think it would be recommended to add it, as
>>> the issue might be encountered by other users, as well.
>>>
>>> Kind regards,
>>> Bogdan
>>>
>>>
>>> On Fri, Mar 20, 2015 at 10:33 AM, Nick Fisk  wrote:
>>>
 I see the Problem, as your OSD's are only 8GB they have a zero weight,
 I think the minimum size you can get away with is 10GB in Ceph as the size
 is measured in TB and only has 2 decimal places.

 For a work around try running :-

 ceph osd crush reweight osd.X 1

 for each osd, this will reweight the OSD's. Assuming this is a test
 cluster and you won't be adding any larger OSD's in the future this
 shouldn't cause any problems.

 >
 > admin@cp-admin:~/safedrive$ ceph osd tree
 > # idweighttype nameup/downreweight
 > -10root default
 > -20host osd-001
 > 00osd.0up1
 > 10osd.1up1
 > -30host osd-002
 > 20osd.2up1
 > 30osd.3up1
 > -40host osd-003
 > 40osd.4up1
 > 50osd.5up1





>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs issue

2015-03-20 Thread Robert LeBlanc
I like this idea. I was under the impression that udev did not call the
init script, but ceph-disk directly. I don't see ceph-disk calling
create-or-move, but I know it does because I see it in the ceph -w when I
boot up OSDs.

/lib/udev/rules.d/95-ceph-osd.rules
# activate ceph-tagged partitions
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  RUN+="/usr/sbin/ceph-disk-activate /dev/$name"


On Fri, Mar 20, 2015 at 2:36 PM, Craig Lewis 
wrote:

> This seems to be a fairly consistent problem for new users.
>
>  The create-or-move is adjusting the crush weight, not the osd weight.
> Perhaps the init script should set the defaultweight to 0.01 if it's <= 0?
>
> It seems like there's a downside to this, but I don't see it.
>
>
>
>
> On Fri, Mar 20, 2015 at 1:25 PM, Robert LeBlanc 
> wrote:
>
>> The weight can be based on anything, size, speed, capability, some random
>> value, etc. The important thing is that it makes sense to you and that you
>> are consistent.
>>
>> Ceph by default (ceph-disk and I believe ceph-deploy) take the approach
>> of using size. So if you use a different weighting scheme, you should
>> manually add the OSDs, or "clean up" after using ceph-disk/ceph-deploy.
>> Size works well for most people, unless the disks are less than 10 GB so
>> most people don't bother messing with it.
>>
>> On Fri, Mar 20, 2015 at 12:06 PM, Bogdan SOLGA 
>> wrote:
>>
>>> Thank you for the clarifications, Sahana!
>>>
>>> I haven't got to that part, yet, so these details were (yet) unknown to
>>> me. Perhaps some information on the PGs weight should be provided in the
>>> 'quick deployment' page, as this issue might be encountered in the future
>>> by other users, as well.
>>>
>>> Kind regards,
>>> Bogdan
>>>
>>>
>>> On Fri, Mar 20, 2015 at 12:05 PM, Sahana  wrote:
>>>
>>>> Hi Bogdan,
>>>>
>>>>  Here is the link for hardware recccomendations :
>>>> http://ceph.com/docs/master/start/hardware-recommendations/#hard-disk-drives.
>>>> As per this link, minimum  size  reccommended for osds  is 1TB.
>>>>  Butt as Nick said, Ceph OSDs must be min. 10GB to get an weight of
>>>> 0.01
>>>> Here is the snippet from crushmaps section of ceph docs:
>>>>
>>>> Weighting Bucket Items
>>>>
>>>> Ceph expresses bucket weights as doubles, which allows for fine
>>>> weighting. A weight is the relative difference between device capacities.
>>>> We recommend using 1.00 as the relative weight for a 1TB storage
>>>> device. In such a scenario, a weight of 0.5 would represent
>>>> approximately 500GB, and a weight of 3.00 would represent
>>>> approximately 3TB. Higher level buckets have a weight that is the sum total
>>>> of the leaf items aggregated by the bucket.
>>>>
>>>> Thanks
>>>>
>>>> Sahana
>>>>
>>>> On Fri, Mar 20, 2015 at 2:08 PM, Bogdan SOLGA 
>>>> wrote:
>>>>
>>>>> Thank you for your suggestion, Nick! I have re-weighted the OSDs and
>>>>> the status has changed to '256 active+clean'.
>>>>>
>>>>> Is this information clearly stated in the documentation, and I have
>>>>> missed it? In case it isn't - I think it would be recommended to add it, 
>>>>> as
>>>>> the issue might be encountered by other users, as well.
>>>>>
>>>>> Kind regards,
>>>>> Bogdan
>>>>>
>>>>>
>>>>> On Fri, Mar 20, 2015 at 10:33 AM, Nick Fisk  wrote:
>>>>>
>>>>>> I see the Problem, as your OSD's are only 8GB they have a zero
>>>>>> weight, I think the minimum size you can get away with is 10GB in Ceph as
>>>>>> the size is measured in TB and only has 2 decimal places.
>>>>>>
>>>>>> For a work around try running :-
>>>>>>
>>>>>> ceph osd crush reweight osd.X 1
>>>>>>
>>>>>> for each osd, this will reweight the OSD's. Assuming this is a test
>>>>>> cluster and you won't be adding any larger OSD's in the future this
>>>>>> shouldn't cause any problems.
>>>>>&

Re: [ceph-users] Fwd: OSD Forece Removal

2015-03-20 Thread Robert LeBlanc
Does it show DNE in the entry? That stands for Does Not Exist. It will
disappear on it's own after a while. I don't know what the timeout is, but
they have always gone away within 24 hours. I've edited the CRUSH map
before and I don't think it removed it when it was already DNE, I just had
to wait for it to go away on it's own.

On Fri, Mar 20, 2015 at 3:55 PM, Jesus Chavez (jeschave)  wrote:

>  Maybe I should Edit the crushmap and delete osd... Is that a way yo
> force them?
>
>  Thanks
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
> On Mar 20, 2015, at 2:21 PM, Robert LeBlanc  wrote:
>
>   Removing the OSD from the CRUSH map and deleting the auth key is how
> you force remove an OSD. The OSD can no longer participate in the cluster,
> even if it does come back to life. All clients forget about the OSD when
> the new CRUSH map is distributed.
>
> On Fri, Mar 20, 2015 at 11:19 AM, Jesus Chavez (jeschave) <
> jesch...@cisco.com> wrote:
>
>>  Any idea how to forcé remove ? Thanks
>>
>>
>> * Jesus Chavez*
>> SYSTEMS ENGINEER-C.SALES
>>
>> jesch...@cisco.com
>> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
>> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>>
>> CCIE - 44433
>>
>> Begin forwarded message:
>>
>>  *From:* Stéphane DUGRAVOT 
>> *Date:* March 20, 2015 at 3:49:11 AM CST
>> *To:* "Jesus Chavez (jeschave)" 
>> *Cc:* ceph-users 
>> *Subject:* *Re: [ceph-users] OSD Forece Removal*
>>
>>
>>
>>  --
>>
>> Hi all, can anybody tell me how can I force delete  osds? the thing is
>> that one node got corrupted because of outage, so there is no way to get
>> those osd up and back, is there anyway to force the removal from
>> ceph-deploy node?
>>
>>
>>  Hi,
>>  Try manual :
>>
>>-
>>
>> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
>>
>>
>>  Thanks
>>
>>
>> * Jesus Chavez*
>> SYSTEMS ENGINEER-C.SALES
>>
>> jesch...@cisco.com
>> Phone: *+52 55 5267 3146 <%2B52%2055%205267%203146>*
>> Mobile: *+51 1 5538883255*
>>
>> CCIE - 44433
>>
>>
>> Cisco.com <http://www.cisco.com/>
>>
>>
>>
>>
>>
>>   Think before you print.
>>
>> This email may contain confidential and privileged material for the sole
>> use of the intended recipient. Any review, use, distribution or disclosure
>> by others is strictly prohibited. If you are not the intended recipient (or
>> authorized to receive for the recipient), please contact the sender by
>> reply email and delete all copies of this message.
>>
>> Please click here
>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>> Company Registration Information.
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Forece Removal

2015-03-20 Thread Robert LeBlanc
Yes, at this point, I'd export the CRUSH, edit it and import it back in.
What version are you running?

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 20, 2015 4:28 PM, "Jesus Chavez (jeschave)" 
wrote:

>  thats what you sayd?
>
>  [root@capricornio ~]# ceph auth del osd.9
> entity osd.9 does not exist
> [root@capricornio ~]# ceph auth del osd.19
> entity osd.19 does not exist
>
>
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <%2B52%2055%205267%203146>*
> Mobile: *+51 1 5538883255*
>
> CCIE - 44433
>
>
> Cisco.com <http://www.cisco.com/>
>
>
>
>
>
>   Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>  On Mar 20, 2015, at 4:13 PM, Robert LeBlanc  wrote:
>
>  Does it show DNE in the entry? That stands for Does Not Exist. It will
> disappear on it's own after a while. I don't know what the timeout is, but
> they have always gone away within 24 hours. I've edited the CRUSH map
> before and I don't think it removed it when it was already DNE, I just had
> to wait for it to go away on it's own.
>
> On Fri, Mar 20, 2015 at 3:55 PM, Jesus Chavez (jeschave) <
> jesch...@cisco.com> wrote:
>
>>  Maybe I should Edit the crushmap and delete osd... Is that a way yo
>> force them?
>>
>>  Thanks
>>
>>
>> * Jesus Chavez*
>> SYSTEMS ENGINEER-C.SALES
>>
>> jesch...@cisco.com
>> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
>> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>>
>> CCIE - 44433
>>
>> On Mar 20, 2015, at 2:21 PM, Robert LeBlanc  wrote:
>>
>>   Removing the OSD from the CRUSH map and deleting the auth key is how
>> you force remove an OSD. The OSD can no longer participate in the cluster,
>> even if it does come back to life. All clients forget about the OSD when
>> the new CRUSH map is distributed.
>>
>> On Fri, Mar 20, 2015 at 11:19 AM, Jesus Chavez (jeschave) <
>> jesch...@cisco.com> wrote:
>>
>>>  Any idea how to forcé remove ? Thanks
>>>
>>>
>>> * Jesus Chavez*
>>> SYSTEMS ENGINEER-C.SALES
>>>
>>> jesch...@cisco.com
>>> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
>>> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>>>
>>> CCIE - 44433
>>>
>>> Begin forwarded message:
>>>
>>>  *From:* Stéphane DUGRAVOT 
>>> *Date:* March 20, 2015 at 3:49:11 AM CST
>>> *To:* "Jesus Chavez (jeschave)" 
>>> *Cc:* ceph-users 
>>> *Subject:* *Re: [ceph-users] OSD Forece Removal*
>>>
>>>
>>>
>>>  --
>>>
>>> Hi all, can anybody tell me how can I force delete  osds? the thing is
>>> that one node got corrupted because of outage, so there is no way to get
>>> those osd up and back, is there anyway to force the removal from
>>> ceph-deploy node?
>>>
>>>
>>>  Hi,
>>>  Try manual :
>>>
>>>-
>>>
>>> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
>>>
>>>
>>>  Thanks
>>>
>>>
>>> * Jesus Chavez*
>>> SYSTEMS ENGINEER-C.SALES
>>>
>>> jesch...@cisco.com
>>> Phone: *+52 55 5267 3146 <%2B52%2055%205267%203146>*
>>> Mobile: *+51 1 5538883255*
>>>
>>> CCIE - 44433
>>>
>>>
>>> Cisco.com <http://www.cisco.com/>
>>>
>>>
>>>
>>>
>>>
>>>   Think before you print.
>>>
>>> This email may contain confidential and privileged material for the sole
>>> use of the intended recipient. Any review, use, distribution or disclosure
>>> by others is strictly prohibited. If you are not the intended recipient (or
>>> authorized to receive for the recipient), please contact the sender by
>>> reply email and delete all copies of this message.
>>>
>>> Please click here
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>>> Company Registration Information.
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing a failed OSD disk drive (or replace XFS with BTRFS)

2015-03-21 Thread Robert LeBlanc
When you reformat the drive, it generates a new UUID so to Ceph it is as if
it was a brand new drive. This does seem heavy handed, but ceph was
designed for things to fail and it is not unusual to do things this way.
Ceph is not RAID so you usually have to do some unthinking.

You could probably keep the UUID and the Auth key between reformats, but in
my experience if is so easy to just have Ceph regenerate it, it's not worth
the hassle of trying to keep track of it all.

In our testing we formatted the cluster over a dozen times without losing
data. Because there wasn't much data on it we were able to format 40 OSDs
in under 30 minutes (we formatted a while host at a time because we knew
that was safe ) with a few little online scripts.

Short answer is don't be afraid to do it this way.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 21, 2015 5:11 AM, "Datatone Lists"  wrote:

> I have been experimenting with Ceph, and have some OSDs with drives
> containing XFS filesystems which I want to change to BTRFS.
> (I started with BTRFS, then started again from scratch with XFS
> [currently recommended] in order to eleminate that as a potential cause
> of some issues, now with further experience, I want to go back to
> BTRFS, but have data in my cluster and I don't want to scrap it).
>
> This is exactly equivalent to the case in which I have an OSD with a
> drive that I see is starting to error. I would in that case need to
> replace the drive and recreate the Ceph structures on it.
>
> So, I mark the OSD out, and the cluster automatically eliminates its
> notion of data stored on the OSD and creates copies of the affected PGs
> elsewhere to make the cluster healthy again.
>
> All of the disk replacement instructions that I see then tell me to
> then follow an OSD removal process:
>
> "This procedure removes an OSD from a cluster map, removes its
> authentication key, removes the OSD from the OSD map, and removes the
> OSD from the ceph.conf file".
>
> This seems to me to be too heavy-handed. I'm worried about doing this
> and then effectively adding a new OSD where I have the same id number
> as the OSD that I apparently unnecessarily removed.
>
> I don't actually want to remove the OSD. The OSD is fine, I just want
> to replace the disk drive that it uses.
>
> This suggests that I really want to take the OSD out, allow the cluster
> to get healthy again, then (replace the disk if this is due to
> failure,) create a new BTRFS/XFS filesystem, remount the drive, then
> recreate the Ceph structures on the disk to be compatible with the old
> disk and the original OSD that it was attached to.
>
> The OSD then gets marked back in, the cluster says "hello again, we
> missed you, but its good to see you back, here are some PGs ...".
>
> What I'm saying is that I really don't want to destroy the OSD, I want
> to refresh it with a new disk/filesystem and put it back to work.
>
> Is there some fundamental reason why this can't be done? If not, how
> should I do it?
>
> Best regards,
> David
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD's in a Each node with replica 2

2015-03-23 Thread Robert LeBlanc
I don't have a fresh cluster on hand to double check, but the default is to
select a different host for each replica. You can adjust that to fit your
needs, we are using cabinet as the selection criteria so that we can lose
an entire cabinet of storage and still function.

In order to store multiple replicas on the same node, you will need to
change this to osd from host.

Please see http://ceph.com/docs/master/rados/operations/crush-map/

On Tue, Mar 3, 2015 at 7:39 PM, Azad Aliyar 
wrote:

>
> I  have a doubt . In a scenario (3nodes x 4osd each x 2replica)  I tested
> with a node down and as long as you have space available all objects were
> there.
>
> Is it possible all replicas of an object to be saved in the same node?
>
> Is it possible to lose any?
>
> Is there a mechanism that prevents replicas to be stored in another osd in
> the same node?
>
> I would love someone to answer it and any information is highly
> appreciated.
> --
>Warm Regards,  Azad Aliyar
>  Linux Server Engineer
>  *Email* :  azad.ali...@sparksupport.com   *|*   *Skype* :   spark.azad
>  
> 
> 
> 3rd Floor, Leela Infopark, Phase
> -2,Kakanad, Kochi-30, Kerala, India  *Phone*:+91 484 6561696 , 
> *Mobile*:91-8129270421.
>   *Confidentiality Notice:* Information in this e-mail is proprietary to
> SparkSupport. and is intended for use only by the addressed, and may
> contain information that is privileged, confidential or exempt from
> disclosure. If you are not the intended recipient, you are notified that
> any use of this information in any manner is strictly prohibited. Please
> delete this mail & notify us immediately at i...@sparksupport.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc
I was trying to decompile and edit the CRUSH map to adjust the CRUSH
rules. My first attempt created a map that would decompile, but I
could not recompile the CRUSH even if didn't modify it. When trying to
download the CRUSH fresh, now the decompile fails.

[root@nodezz ~]# ceph osd getmap -o map.crush
got osdmap epoch 12792
[root@nodezz ~]# crushtool -d map.crush -o map
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
  what():  buffer::malformed_input: bad magic number
*** Caught signal (Aborted) **
 in thread 7f889ed24780
 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
 in thread 7f889ed24780

 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
register_command perfcounters_dump hook 0x322be00
   -13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
register_command 1 hook 0x322be00
   -12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
register_command perf dump hook 0x322be00
   -11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
register_command perfcounters_schema hook 0x322be00
   -10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
register_command 2 hook 0x322be00
-9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
register_command perf schema hook 0x322be00
-8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
register_command perf reset hook 0x322be00
-7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
register_command config show hook 0x322be00
-6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
register_command config set hook 0x322be00
-5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
register_command config get hook 0x322be00
-4> 2015-03-23 12:46:34.633672 7f889ed24780  5 asok(0x3229cc0)
register_command config diff hook 0x322be00
-3> 2015-03-23 12:46:34.633685 7f889ed24780  5 asok(0x3229cc0)
register_command log flush hook 0x322be00
-2> 2015-03-23 12:46:34.633698 7f889ed24780  5 asok(0x3229cc0)
register_command log dump hook 0x322be00
-1> 2015-03-23 12:46:34.633711 7f889ed24780  5 asok(0x3229cc0)
register_command log reopen hook 0x322be00
 0> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal
(Aborted) **
 in thread 7f889ed24780

 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent   500
  max_new 1000
  log_file
---

Re: [ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc
Ok, so the decompile error is because I didn't download the CRUSH map
(found that out using hexdump), but I still can't compile an
unmodified CRUSH map.

[root@nodezz ~]# crushtool -d map.crush -o map
[root@nodezz ~]# crushtool -c map -o map.crush
map:105 error: parse error at ''

For some reason it doesn't like the rack definition. I can move things
around, like putting root before it and it always chokes on the first
rack definition no matter which one it is.

On Mon, Mar 23, 2015 at 12:53 PM, Robert LeBlanc  wrote:
> I was trying to decompile and edit the CRUSH map to adjust the CRUSH
> rules. My first attempt created a map that would decompile, but I
> could not recompile the CRUSH even if didn't modify it. When trying to
> download the CRUSH fresh, now the decompile fails.
>
> [root@nodezz ~]# ceph osd getmap -o map.crush
> got osdmap epoch 12792
> [root@nodezz ~]# crushtool -d map.crush -o map
> terminate called after throwing an instance of 'ceph::buffer::malformed_input'
>   what():  buffer::malformed_input: bad magic number
> *** Caught signal (Aborted) **
>  in thread 7f889ed24780
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>  10: (main()+0x1e0e) [0x4ead4e]
>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>  12: crushtool() [0x4ee5a9]
> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
>  in thread 7f889ed24780
>
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>  10: (main()+0x1e0e) [0x4ead4e]
>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>  12: crushtool() [0x4ee5a9]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- begin dump of recent events ---
>-14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
> register_command perfcounters_dump hook 0x322be00
>-13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
> register_command 1 hook 0x322be00
>-12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
> register_command perf dump hook 0x322be00
>-11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
> register_command perfcounters_schema hook 0x322be00
>-10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
> register_command 2 hook 0x322be00
> -9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
> register_command perf schema hook 0x322be00
> -8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
> register_command perf reset hook 0x322be00
> -7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
> register_command config show hook 0x322be00
> -6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
> register_command config set hook 0x322be00
> -5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
> register_command config get hook 0x322be00
> -4> 2015-03-23 12:46:34.633672 7f889ed24780  5 asok(0x3229cc0)
> register_command config diff hook 0x322be00
> -3> 2015-03-23 12:46:34.633685 7f889ed24780  5 asok(0x3229cc0)
> register_command log flush hook 0x322be00
> -2> 2015-03-23 12:46:34.633698 7f889ed24780  5 asok(0x3229cc0)
> register_command log dump hook 0x322be00
> -1> 2015-03-23 12:46:34.633711 7f889ed24780  5 asok(0x3229cc0)
> register_command log reopen hook 0x322be00
>  0> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal
> (Aborted) **
>  in thread 7f889ed24780
>
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::itera

Re: [ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc
OK, sorry for all the quick e-mails, but I got it to compile. For some
reason there are a few errors from decompiling the CRUSH map.

1. The decompiled map has "alg straw2" which is not vaild, removing
the 2 lets it compile
2. The hosts have weight 0.000, which I don't think prevents the map
from compiling, but will cause other issues.

I created the rack entries on the command line and moved the host
buckets to the racks, then exported the CRUSH to modify the rules.

ceph osd crush add-bucket racka rack
ceph osd crush add-bucket rackb rack
ceph osd crush move nodev rack=racka
ceph osd crush move nodew rack=racka
ceph osd crush move nodex rack=rackb
ceph osd crush move nodey rack=rackb
ceph osd crush move racka root=default
ceph osd crush move rackb root=default
ceph osd crush add-bucket ssd-racka rack
ceph osd crush add-bucket ssd-rackb rack
ceph osd crush move ssd-racka root=ssd
ceph osd crush move ssd-rackb root=ssd
ceph osd crush move nodev-ssd rack=ssd-racka
ceph osd crush move nodew-ssd rack=ssd-racka
ceph osd crush move nodex-ssd rack=ssd-rackb
ceph osd crush move nodey-ssd rack=ssd-rackb

Just saw the e-mail from Sage saying that is all fixed after .93
(which we are on). Saving for posterity's sake. Thanks Sage!

On Mon, Mar 23, 2015 at 1:09 PM, Robert LeBlanc  wrote:
> Ok, so the decompile error is because I didn't download the CRUSH map
> (found that out using hexdump), but I still can't compile an
> unmodified CRUSH map.
>
> [root@nodezz ~]# crushtool -d map.crush -o map
> [root@nodezz ~]# crushtool -c map -o map.crush
> map:105 error: parse error at ''
>
> For some reason it doesn't like the rack definition. I can move things
> around, like putting root before it and it always chokes on the first
> rack definition no matter which one it is.
>
> On Mon, Mar 23, 2015 at 12:53 PM, Robert LeBlanc  wrote:
>> I was trying to decompile and edit the CRUSH map to adjust the CRUSH
>> rules. My first attempt created a map that would decompile, but I
>> could not recompile the CRUSH even if didn't modify it. When trying to
>> download the CRUSH fresh, now the decompile fails.
>>
>> [root@nodezz ~]# ceph osd getmap -o map.crush
>> got osdmap epoch 12792
>> [root@nodezz ~]# crushtool -d map.crush -o map
>> terminate called after throwing an instance of 
>> 'ceph::buffer::malformed_input'
>>   what():  buffer::malformed_input: bad magic number
>> *** Caught signal (Aborted) **
>>  in thread 7f889ed24780
>>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>>  1: crushtool() [0x4f4542]
>>  2: (()+0xf130) [0x7f889df97130]
>>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>>  4: (abort()+0x148) [0x7f889cfd1cd8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>>  6: (()+0x5e946) [0x7f889d8d2946]
>>  7: (()+0x5e973) [0x7f889d8d2973]
>>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>>  10: (main()+0x1e0e) [0x4ead4e]
>>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>>  12: crushtool() [0x4ee5a9]
>> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
>>  in thread 7f889ed24780
>>
>>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>>  1: crushtool() [0x4f4542]
>>  2: (()+0xf130) [0x7f889df97130]
>>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>>  4: (abort()+0x148) [0x7f889cfd1cd8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>>  6: (()+0x5e946) [0x7f889d8d2946]
>>  7: (()+0x5e973) [0x7f889d8d2973]
>>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>>  10: (main()+0x1e0e) [0x4ead4e]
>>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>>  12: crushtool() [0x4ee5a9]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> --- begin dump of recent events ---
>>-14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
>> register_command perfcounters_dump hook 0x322be00
>>-13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
>> register_command 1 hook 0x322be00
>>-12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
>> register_command perf dump hook 0x322be00
>>-11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
>> register_command perfcounters_schema hook 0x322be00
>>-10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
>> register_command 2 hook 0x322be00
>> -9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
>> register_

Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Robert LeBlanc
You just need to change your rule from

step chooseleaf firstn 0 type osd

to

step chooseleaf firstn 0 type host

There will be data movement as it will want to move about half the
objects to the new host. There will be data generation as you move
from size 1 to size 2. As far as I know a deep scrub won't happen
until the next scheduled time. The time to do all of this is dependent
on your disk speed, network speed, CPU and RAM capacity as well as the
number of backfill processes configured, the priority of the backfill
process, how active your disks are and how much data you have stored
in the cluster. In short ... it depends.

On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
 wrote:
> Hi all!
>
> I had a CEPH Cluster with 10x OSDs all of them in one node.
>
> Since the cluster was built from the beginning with just one OSDs node the
> crushmap had as a default
> the replication to be on OSDs.
>
> Here is the relevant part from my crushmap:
>
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type osd
> step emit
> }
>
> # end crush map
>
>
> I have added a new node with 10x more identical OSDs thus the total OSDs
> nodes are now two.
>
> I have changed the replication factor to be 2 on all pools and I would like
> to make sure that
> I always keep each copy on a different node.
>
> In order to do so do I have to change the CRUSH map?
>
> Which part should I change?
>
>
> After modifying the CRUSH map what procedure will take place before the
> cluster is ready again?
>
> Is it going to start re-balancing and moving data around? Will a deep-scrub
> follow?
>
> Does the time of the procedure depends on anything else except the amount of
> data and the available connection (bandwidth)?
>
>
> Looking forward for your answers!
>
>
> All the best,
>
>
> George
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Robert LeBlanc
I don't believe that you can set the schedule of the deep scrubs.
People that want that kind of control disable deep scrubs and run a
script to scrub all PGs. For the other options, you should look
through http://ceph.com/docs/master/rados/configuration/osd-config-ref/
and find what you feel might be most important to you. We mess with
"osd max backfills". You may want to look at "osd recovery max
active", "osd recovery op priority" to name a few. You can adjust the
idle load of the cluster to perform deep scrubs, etc.

On Mon, Mar 23, 2015 at 5:10 PM, Dimitrakakis Georgios
 wrote:
> Robert thanks for the info!
>
> How can I find out and modify when is scheduled the next deep scrub,
> the number of backfill processes and their priority?
>
> Best regards,
>
> George
>
>
>
>  Ο χρήστης Robert LeBlanc έγραψε 
>
>
> You just need to change your rule from
>
> step chooseleaf firstn 0 type osd
>
> to
>
> step chooseleaf firstn 0 type host
>
> There will be data movement as it will want to move about half the
> objects to the new host. There will be data generation as you move
> from size 1 to size 2. As far as I know a deep scrub won't happen
> until the next scheduled time. The time to do all of this is dependent
> on your disk speed, network speed, CPU and RAM capacity as well as the
> number of backfill processes configured, the priority of the backfill
> process, how active your disks are and how much data you have stored
> in the cluster. In short ... it depends.
>
> On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
>  wrote:
>> Hi all!
>>
>> I had a CEPH Cluster with 10x OSDs all of them in one node.
>>
>> Since the cluster was built from the beginning with just one OSDs node the
>> crushmap had as a default
>> the replication to be on OSDs.
>>
>> Here is the relevant part from my crushmap:
>>
>>
>> # rules
>> rule replicated_ruleset {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type osd
>> step emit
>> }
>>
>> # end crush map
>>
>>
>> I have added a new node with 10x more identical OSDs thus the total OSDs
>> nodes are now two.
>>
>> I have changed the replication factor to be 2 on all pools and I would
>> like
>> to make sure that
>> I always keep each copy on a different node.
>>
>> In order to do so do I have to change the CRUSH map?
>>
>> Which part should I change?
>>
>>
>> After modifying the CRUSH map what procedure will take place before the
>> cluster is ready again?
>>
>> Is it going to start re-balancing and moving data around? Will a
>> deep-scrub
>> follow?
>>
>> Does the time of the procedure depends on anything else except the amount
>> of
>> data and the available connection (bandwidth)?
>>
>>
>> Looking forward for your answers!
>>
>>
>> All the best,
>>
>>
>> George
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does crushtool --test --simulate do what cluster should do?

2015-03-23 Thread Robert LeBlanc
I'm trying to create a CRUSH ruleset and I'm using crushtool to test
the rules, but it doesn't seem to mapping things correctly. I have two
roots, on for spindles and another for SSD. I have two rules, one for
each root. The output of crushtool on rule 0 shows objects being
mapped to SSD OSDs when it should only be choosing spindles.

I'm pretty sure I'm doing something wrong. I've tested the map on .93 and .80.8.

The map is at http://pastebin.com/BjmuASX0

when running

crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate --show-mappings

I'm getting mapping to OSDs > 39 which are SSDs. The same happens when
I run the SSD rule, I get OSDs from both roots. It is as if crushtool
is not selecting the correct root. In fact both rules result in the
same mapping:

RNG rule 0 x 0 [0,38,23]
RNG rule 0 x 1 [10,25,1]
RNG rule 0 x 2 [11,40,0]
RNG rule 0 x 3 [5,30,26]
RNG rule 0 x 4 [44,30,10]
RNG rule 0 x 5 [8,26,16]
RNG rule 0 x 6 [24,5,36]
RNG rule 0 x 7 [38,10,9]
RNG rule 0 x 8 [39,9,23]
RNG rule 0 x 9 [12,3,24]
RNG rule 0 x 10 [18,6,41]
...

RNG rule 1 x 0 [0,38,23]
RNG rule 1 x 1 [10,25,1]
RNG rule 1 x 2 [11,40,0]
RNG rule 1 x 3 [5,30,26]
RNG rule 1 x 4 [44,30,10]
RNG rule 1 x 5 [8,26,16]
RNG rule 1 x 6 [24,5,36]
RNG rule 1 x 7 [38,10,9]
RNG rule 1 x 8 [39,9,23]
RNG rule 1 x 9 [12,3,24]
RNG rule 1 x 10 [18,6,41]
...


Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write IO Problem

2015-03-24 Thread Robert LeBlanc
I can not reproduce the snapshot issue with BTRFS on the 3.17 kernel. My
test cluster had 48 btrfs OSDs on BTRFS for four months without an issue
since going to 3.17. The only concern I have is potential slowness over
time. We are not using compression. We are going production in one month
and although we haven't had show stopping issues with BTRFS, we are still
going to start on XFS. Our plan is to build a cluster as a target for our
backup system and we will put BTRFS on that to prove it in a production
setting.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 24, 2015 7:00 AM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Tue, 24 Mar 2015 07:43:00 + Rottmann Jonas wrote:
>
> > Hi,
> >
> > First of all, thank you for your detailed answer.
> >
> > My Ceph Version is Hammer, sry should have mentioned that.
> >
> > Yes we have 2 Intel 320 for the OS, the think process behind this was
> > that the OS Disk is not that important, and they were cheap but
> > SSDs(power consumption).
> >
> Fair enough.
>
> > The Plan was to put the cluster into production if it works well, but
> > that way we will have to replace the SSDs.
> >
> Those are definitely not durable enough for anything resembling normal
> Ceph usage, never mind what other issues they may have in terms of speed.
>
> Lets assume that your initial fio with 8KB blocks on the actual SSD means
> that it can do about 1 4KB write IOPS. With the journal on that same
> SSD that means you really only have 5000 IOPS left.
> 8 OSDs would make 4 IOPS (ignoring all the other overhead and
> latency), replication of 3 means only 13400 IOPS in the end.
>
> > We chose BTRFS because it is stable, and you can often read it is more
> > performant than XFS. (What speaks against BTRFS besides fragmentation
> > which is irrelevant for SSDs?)
> >
> There are many BTRFS threads in the ML archives, off the top of my head
> regressions in certain kernels that affect stability in general come to
> mind. And snapshots in particular with pretty much any kernel that was
> discussed.
>
> > I'm sry that I used different benchmarks, but they all were far far away
> > from that what I would expect.
> >
> Again, have a look at the various SSD threads, what _did_ you expect?
>
> But what your cluster is _actually_ capable of in the worst case is best
> seen with "rados bench", no caching, just raw ability.
>
> Any improvements over that by RBD cache are just the icing on the cake,
> don't take them for granted.
>
> When doing the various benchmarks, keep an eye on all your storage nodes
> with atop or the likes.
>
> I wouldn't surprise me (as Alexandre suggested) that you're running out of
> CPU power as well.
>
> The best I got out of a single node, 8 SSD OSD "cluster" was about 4500
> IOPS (4KB) with "rados bench" and the machine was totally CPU bound at
> that point (Firefly).
>
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>
> > I could narrow it down to the problem the SSDs performing very bad with
> > direct and dsync.
> >
> > What I don't understand how can it be that the internal benchmark gives
> > good values, when each osd only give ~200IOPS with direct,dsync?
> >
> And once more, the fio with 512KB blocks very much matches the "rbd
> bench-write" results you saw when adjusting for the different block sizes.
>
> Christian
> >
> >
> > Hello,
> >
> > If you had used "performance" or "slow" in your subject future
> > generations would be able find this thread and what it is about more
> > easily. ^_-
> >
> > Also, check the various "SSD" + "performance" threads in the ML archives.
> >
> > On Fri, 20 Mar 2015 14:13:19 + Rottmann Jonas wrote:
> >
> > > Hi,
> > >
> > > We have a huge write IO Problem in our preproductive Ceph Cluster.
> > > First our Hardware:
> > >
> > You're not telling us your Ceph version, but from the tunables below I
> > suppose it is Firefly? If you have the time, it would definitely be
> > advisable to wait for Hammer with an all SSD cluster.
> >
> > > 4 OSD Nodes with:
> > >
> > > Supermicro X10 Board
> > > 32GB DDR4 RAM
> > > 2x Intel Xeon E5-2620
> > > LSI SAS 9300-8i Host Bus Adapter
> > > Intel Corporation 82599EB 10-Gigabit
> > > 2x Intel SSDSA2CT040G3 in software raid 1 for system
> > >
> > Nobody really knows what those inane Intel pro

Re: [ceph-users] error creating image in rbd-erasure-pool

2015-03-24 Thread Robert LeBlanc
Is there an enumerated list of issues with snapshots on cache pools.
We currently have snapshots on a cache tier and haven't seen any
issues (development cluster). I just want to know what we should be
looking for.

On Tue, Mar 24, 2015 at 9:21 AM, Stéphane DUGRAVOT
 wrote:
>
>
> 
>
> Hi Markus,
>
> On 24/03/2015 14:47, Markus Goldberg wrote:
>> Hi,
>> this is ceph version 0,93
>> I can't create an image in an rbd-erasure-pool:
>>
>> root@bd-0:~#
>> root@bd-0:~# ceph osd pool create bs3.rep 4096 4096 replicated
>> pool 'bs3.rep' created
>> root@bd-0:~# rbd create --size 1000 --pool bs3.rep test
>> root@bd-0:~#
>> root@bd-0:~# ceph osd pool create bs3.era 4096 4096 erasure
>> pool 'bs3.era' created
>> root@bd-0:~# rbd create --size 1000 --pool bs3.era tapes2
>> rbd: create error: (95) Operation not supported2015-03-24 13:57:31.018411
>> 7fc186b77840 -1
>> librbd: error adding image to directory: (95) Operation not supported
>
> RBD won't work with erasure coded pools. Instead you could try adding a
> replicated cache pool and use it.
>
> See http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ for
> more information.
>
>
> Hi Loic and Markus,
> By the way, Inktank do not support snapshot of a pool with cache tiering :
>
> https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf
>
> What's wrong exactly with that ? i suppose that some features is not
> possible ?
> Do you know what ?
> Thanks,
> Stephane.
>
>
> Cheers
>
>>
>> Is this not possible at the moment or am i mistyping?
>>
>> BTW: Deleting or shrinking an empty image takes very, very lonng
>>
>> Thank you,
>>   Markus
>>
>> --
>> Markus Goldberg   Universität Hildesheim
>>   Rechenzentrum
>> Tel +49 5121 88392822 Universitätsplatz 1, D-31141 Hildesheim, Germany
>> Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
>> --
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does crushtool --test --simulate do what cluster should do?

2015-03-24 Thread Robert LeBlanc
I'm not sure why crushtool --test --simulate doesn't match what the
cluster actually does, but the cluster seems to be executing the rules
even though crushtool doesn't. Just kind of stinks that you have to
test the rules on actual data.

Should I create a ticket for this?

On Mon, Mar 23, 2015 at 6:08 PM, Robert LeBlanc  wrote:
> I'm trying to create a CRUSH ruleset and I'm using crushtool to test
> the rules, but it doesn't seem to mapping things correctly. I have two
> roots, on for spindles and another for SSD. I have two rules, one for
> each root. The output of crushtool on rule 0 shows objects being
> mapped to SSD OSDs when it should only be choosing spindles.
>
> I'm pretty sure I'm doing something wrong. I've tested the map on .93 and 
> .80.8.
>
> The map is at http://pastebin.com/BjmuASX0
>
> when running
>
> crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate --show-mappings
>
> I'm getting mapping to OSDs > 39 which are SSDs. The same happens when
> I run the SSD rule, I get OSDs from both roots. It is as if crushtool
> is not selecting the correct root. In fact both rules result in the
> same mapping:
>
> RNG rule 0 x 0 [0,38,23]
> RNG rule 0 x 1 [10,25,1]
> RNG rule 0 x 2 [11,40,0]
> RNG rule 0 x 3 [5,30,26]
> RNG rule 0 x 4 [44,30,10]
> RNG rule 0 x 5 [8,26,16]
> RNG rule 0 x 6 [24,5,36]
> RNG rule 0 x 7 [38,10,9]
> RNG rule 0 x 8 [39,9,23]
> RNG rule 0 x 9 [12,3,24]
> RNG rule 0 x 10 [18,6,41]
> ...
>
> RNG rule 1 x 0 [0,38,23]
> RNG rule 1 x 1 [10,25,1]
> RNG rule 1 x 2 [11,40,0]
> RNG rule 1 x 3 [5,30,26]
> RNG rule 1 x 4 [44,30,10]
> RNG rule 1 x 5 [8,26,16]
> RNG rule 1 x 6 [24,5,36]
> RNG rule 1 x 7 [38,10,9]
> RNG rule 1 x 8 [39,9,23]
> RNG rule 1 x 9 [12,3,24]
> RNG rule 1 x 10 [18,6,41]
> ...
>
>
> Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does crushtool --test --simulate do what cluster should do?

2015-03-24 Thread Robert LeBlanc
http://tracker.ceph.com/issues/11224

On Tue, Mar 24, 2015 at 12:11 PM, Gregory Farnum  wrote:
> On Tue, Mar 24, 2015 at 10:48 AM, Robert LeBlanc  wrote:
>> I'm not sure why crushtool --test --simulate doesn't match what the
>> cluster actually does, but the cluster seems to be executing the rules
>> even though crushtool doesn't. Just kind of stinks that you have to
>> test the rules on actual data.
>>
>> Should I create a ticket for this?
>
> Yes please! I'm not too familiar with the crushtool internals but the
> simulator code hasn't had too many eyeballs so it's hopefully not too
> hard a bug to fix.
>
>>
>> On Mon, Mar 23, 2015 at 6:08 PM, Robert LeBlanc  wrote:
>>> I'm trying to create a CRUSH ruleset and I'm using crushtool to test
>>> the rules, but it doesn't seem to mapping things correctly. I have two
>>> roots, on for spindles and another for SSD. I have two rules, one for
>>> each root. The output of crushtool on rule 0 shows objects being
>>> mapped to SSD OSDs when it should only be choosing spindles.
>>>
>>> I'm pretty sure I'm doing something wrong. I've tested the map on .93 and 
>>> .80.8.
>>>
>>> The map is at http://pastebin.com/BjmuASX0
>>>
>>> when running
>>>
>>> crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate 
>>> --show-mappings
>>>
>>> I'm getting mapping to OSDs > 39 which are SSDs. The same happens when
>>> I run the SSD rule, I get OSDs from both roots. It is as if crushtool
>>> is not selecting the correct root. In fact both rules result in the
>>> same mapping:
>>>
>>> RNG rule 0 x 0 [0,38,23]
>>> RNG rule 0 x 1 [10,25,1]
>>> RNG rule 0 x 2 [11,40,0]
>>> RNG rule 0 x 3 [5,30,26]
>>> RNG rule 0 x 4 [44,30,10]
>>> RNG rule 0 x 5 [8,26,16]
>>> RNG rule 0 x 6 [24,5,36]
>>> RNG rule 0 x 7 [38,10,9]
>>> RNG rule 0 x 8 [39,9,23]
>>> RNG rule 0 x 9 [12,3,24]
>>> RNG rule 0 x 10 [18,6,41]
>>> ...
>>>
>>> RNG rule 1 x 0 [0,38,23]
>>> RNG rule 1 x 1 [10,25,1]
>>> RNG rule 1 x 2 [11,40,0]
>>> RNG rule 1 x 3 [5,30,26]
>>> RNG rule 1 x 4 [44,30,10]
>>> RNG rule 1 x 5 [8,26,16]
>>> RNG rule 1 x 6 [24,5,36]
>>> RNG rule 1 x 7 [38,10,9]
>>> RNG rule 1 x 8 [39,9,23]
>>> RNG rule 1 x 9 [12,3,24]
>>> RNG rule 1 x 10 [18,6,41]
>>> ...
>>>
>>>
>>> Thanks,
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ERROR: missing keyring, cannot use cephx for authentication

2015-03-25 Thread Robert LeBlanc
It doesn't look like your OSD is mounted. What do you have when you run
mount? How did you create your OSDs?

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 25, 2015 1:31 AM, "oyym...@gmail.com"  wrote:

> Hi,Jesus
> I encountered similar problem.
> *1.* shut down one of nodes, but all osds can't reactive on the node
> after reboot.
> *2.* run service ceph restart  manually, got the same error message:
> [root@storage4 ~]# /etc/init.d/ceph start
> === osd.15 ===
> 2015-03-23 14:43:32.399811 7fed0fcf4700 -1 monclient(hunting): ERROR:
> missing keyring, cannot use cephx for authentication
> 2015-03-23 14:43:32.399814 7fed0fcf4700 0 librados: osd.15 initialization
> error (2) No such file or directory
> Error connecting to cluster: ObjectNotFound
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.15
> --keyring=/var/lib/ceph/osd/ceph-15/keyring osd crush create-or-move -- 15
> 0.19 host=storage4 root=default'
> ..
> 3.  ll /var/lib/ceph/osd/ceph-15/
> total 0
>
> all files *disappeared* in the /var/lib/ceph/osd/ceph-15/
>
>
>
>
> --
> oyym...@gmail.com
>
>
> *From:* Jesus Chavez (jeschave) 
> *Date:* 2015-03-24 05:09
> *To:* ceph-users 
> *Subject:* [ceph-users] ERROR: missing keyring, cannot use cephx for
> authentication
> Hi all, I did HA failover test shutting down 1 node and I see that only 1
> OSD came up after reboot:
>
>  [root@geminis ceph]# df -h
> Filesystem Size  Used Avail Use% Mounted on
> /dev/mapper/rhel-root   50G  4.5G   46G   9% /
> devtmpfs   126G 0  126G   0% /dev
> tmpfs  126G   80K  126G   1% /dev/shm
> tmpfs  126G  9.9M  126G   1% /run
> tmpfs  126G 0  126G   0% /sys/fs/cgroup
> /dev/sda1  494M  165M  330M  34% /boot
> /dev/mapper/rhel-home   36G   44M   36G   1% /home
> /dev/sdc1  3.7T  134M  3.7T   1% /var/lib/ceph/osd/ceph-14
>
>  If I run service ceph restart I got this error message…
>
>  Stopping Ceph osd.94 on geminis...done
> === osd.94 ===
> 2015-03-23 15:05:41.632505 7fe7b9941700 -1 monclient(hunting): ERROR:
> missing keyring, cannot use cephx for authentication
> 2015-03-23 15:05:41.632508 7fe7b9941700  0 librados: osd.94 initialization
> error (2) No such file or directory
> Error connecting to cluster: ObjectNotFound
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.94
> --keyring=/var/lib/ceph/osd/ceph-94/keyring osd crush create-or-move -- 94
> 0.05 host=geminis root=default
>
>
>  I have ceph.conf and ceph.client.admin.keyring under /etc/ceph:
>
>
>  [root@geminis ceph]# ls /etc/ceph
> ceph.client.admin.keyring  ceph.conf  rbdmap  tmp1OqNFi  tmptQ0a1P
> [root@geminis ceph]#
>
>
>  does anybody know what could be wrong?
>
>  Thanks
>
>
>
>
>
> * Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <%2B52%2055%205267%203146>*
> Mobile: *+51 1 5538883255*
>
> CCIE - 44433
>
>
> Cisco.com <http://www.cisco.com/>
>
>
>
>
>
>   Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New deployment: errors starting OSDs: "invalid (someone else's?) journal"

2015-03-25 Thread Robert LeBlanc
I don't know much about ceph-deploy,  but I know that ceph-disk has
problems "automatically" adding an SSD OSD when there are journals of
other disks already on it. I've had to partition the disk ahead of
time and pass in the partitions to make ceph-disk work.

Also, unless you are sure that the dev devices will be deterministicly
named the same each time, I'd recommend you not use /dev/sd* for
pointing to your journals. Instead use something that will always be
the same, since Ceph with partition the disks with GPT, you can use
the partuuid to point to the journal partition and it will always be
right. A while back I used this to "fix" my journal links when I did
it wrong. You will want to double check that it will work right for
you. no warranty and all that jazz...

#convert the /dev/sd* links for journals into UUIDs

for lnk  in $(ls /var/lib/ceph/osd/); do OSD=/var/lib/ceph/osd/$lnk;
DEV=$(readlink $OSD/journal | cut -d'/' -f3); echo $DEV; PUUID=$(ls
-lh /dev/disk/by-partuuid/ | grep $DEV | cut -d' ' -f 9); ln -sf
/dev/disk/by-partuuid/$PUUID $OSD/journal; done

On Wed, Mar 25, 2015 at 10:46 AM, Antonio Messina
 wrote:
> Hi all,
>
> I'm trying to install ceph on a 7-nodes preproduction cluster. Each
> node has 24x 4TB SAS disks (2x dell md1400 enclosures) and 6x 800GB
> SSDs (for cache tiering, not journals). I'm using Ubuntu 14.04 and
> ceph-deploy to install the cluster, I've been trying both Firefly and
> Giant and getting the same error. However, the logs I'm reporting are
> relative to the Firefly installation.
>
> The installation seems to go fine until I try to install the last 2
> OSDs (they are SSD disks) of each host. All the OSDs from 0 to 195 are
> UP and IN, but when I try to deploy the next OSD (no matter what host)
> ceph-osd daemon won't start. The error I get is:
>
> 2015-03-25 17:00:17.130937 7fe231312800  0 ceph version 0.80.9
> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-osd, pid
> 20280
> 2015-03-25 17:00:17.133601 7fe231312800 10
> filestore(/var/lib/ceph/osd/ceph-196) dump_stop
> 2015-03-25 17:00:17.133694 7fe231312800  5
> filestore(/var/lib/ceph/osd/ceph-196) basedir
> /var/lib/ceph/osd/ceph-196 journal /var/lib/ceph/osd/ceph-196/journal
> 2015-03-25 17:00:17.133725 7fe231312800 10
> filestore(/var/lib/ceph/osd/ceph-196) mount fsid is
> 8c2fa707-750a-4773-8918-a368367d9cf5
> 2015-03-25 17:00:17.133789 7fe231312800  0
> filestore(/var/lib/ceph/osd/ceph-196) mount detected xfs (libxfs)
> 2015-03-25 17:00:17.133810 7fe231312800  1
> filestore(/var/lib/ceph/osd/ceph-196)  disabling 'filestore replica
> fadvise' due to known issues with fadvise(DONTNEED) on xfs
> 2015-03-25 17:00:17.135882 7fe231312800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features:
> FIEMAP ioctl is supported and appears to work
> 2015-03-25 17:00:17.135892 7fe231312800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features:
> FIEMAP ioctl is disabled via 'filestore fiemap' config option
> 2015-03-25 17:00:17.136318 7fe231312800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
> 2015-03-25 17:00:17.136373 7fe231312800  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_feature:
> extsize is disabled by conf
> 2015-03-25 17:00:17.136640 7fe231312800  5
> filestore(/var/lib/ceph/osd/ceph-196) mount op_seq is 1
> 2015-03-25 17:00:17.137547 7fe231312800 20 filestore (init)dbobjectmap: seq 
> is 1
> 2015-03-25 17:00:17.137560 7fe231312800 10
> filestore(/var/lib/ceph/osd/ceph-196) open_journal at
> /var/lib/ceph/osd/ceph-196/journal
> 2015-03-25 17:00:17.137575 7fe231312800  0
> filestore(/var/lib/ceph/osd/ceph-196) mount: enabling WRITEAHEAD
> journal mode: checkpoint is not enabled
> 2015-03-25 17:00:17.137580 7fe231312800 10
> filestore(/var/lib/ceph/osd/ceph-196) list_collections
> 2015-03-25 17:00:17.137661 7fe231312800 10 journal journal_replay fs op_seq 1
> 2015-03-25 17:00:17.137668 7fe231312800  2 journal open
> /var/lib/ceph/osd/ceph-196/journal fsid
> 8c2fa707-750a-4773-8918-a368367d9cf5 fs_op_seq 1
> 2015-03-25 17:00:17.137670 7fe22b8b1700 20
> filestore(/var/lib/ceph/osd/ceph-196) sync_entry waiting for
> max_interval 5.00
> 2015-03-25 17:00:17.137690 7fe231312800 10 journal _open_block_device:
> ignoring osd journal size. We'll use the entire block device (size:
> 5367661056)
> 2015-03-25 17:00:17.162489 7fe231312800  1 journal _open
> /var/lib/ceph/osd/ceph-196/journal fd 20: 5367660544 bytes, block size
> 4096 bytes, directio = 1, aio = 1
> 2015-03-25 17:00:17.162502 7fe231312800 10 journal read_header
> 2015-03-25 17:00:17.172249 7fe231312800 10 journal header: block_size
> 4096 alignment 4096 max_size 5367660544
> 2015-03-25 17:00:17.172256 7fe231312800 10 journal header: start 50987008
> 2015-03-25 17:00:17.172257 7fe231312800 10 journal  write_pos 4096
> 2015-03-25 17:00:17.172259 7fe231312800 10 journal open header.fsid =
> 942f2d62-dd99-42a8-878a-f

Re: [ceph-users] New deployment: errors starting OSDs: "invalid (someone else's?) journal"

2015-03-25 Thread Robert LeBlanc
Probably a case of trying to read too fast. Sorry about that.

As far as your theory on the cache pool, I haven't tried that, but my
gut feeling is that it won't help as much as having the journal on the
SSD. The Cache tier isn't trying to collate writes, not like the
journal is doing. Then on the spindle you are having to write to two
very different parts of the drive for every piece of data, although
this is somewhat reduced by the journal, I feel it will still be
significant. When I see writes coming off my SSD journals to the
spindles, I'm still getting a lot of merged IO (at least during a
backfill/recovery). I'm interested in your results.

As far as the foreign journal, I would run dd over the journal
partition and try it again. It sounds like something didn't get
cleaned up from a previous run.


On Wed, Mar 25, 2015 at 11:14 AM, Antonio Messina
 wrote:
> On Wed, Mar 25, 2015 at 6:06 PM, Robert LeBlanc  wrote:
>> I don't know much about ceph-deploy,  but I know that ceph-disk has
>> problems "automatically" adding an SSD OSD when there are journals of
>> other disks already on it. I've had to partition the disk ahead of
>> time and pass in the partitions to make ceph-disk work.
>
> This is not my case: the journal is created automatically by
> ceph-deploy on the same disk, so that for each disk, /dev/sdX1 is the
> data partition and /dev/sdX2 is the journal partition. This is also
> what I want: I know there is a performance drop, but I expect it to be
> mitigated by the cache tier. (and I plan to test both configuration
> anyway)
>
>> Also, unless you are sure that the dev devices will be deterministicly
>> named the same each time, I'd recommend you not use /dev/sd* for
>> pointing to your journals. Instead use something that will always be
>> the same, since Ceph with partition the disks with GPT, you can use
>> the partuuid to point to the journal partition and it will always be
>> right. A while back I used this to "fix" my journal links when I did
>> it wrong. You will want to double check that it will work right for
>> you. no warranty and all that jazz...
>
> Thank you for pointing this out, it's an important point. However, the
> links are actually created using the partuuid. The command I posted in
> my previous email included the output of a pair of nested "readlink"
> in order to get the /dev/sd* names, because in this way it's easier to
> see if there are duplicates and where :)
>
> The output of "ls -l /var/lib/ceph/osd/ceph-*/journal" is actually:
>
> lrwxrwxrwx 1 root root 58 Mar 25 11:38
> /var/lib/ceph/osd/ceph-0/journal ->
> /dev/disk/by-partuuid/18305316-96b0-4654-aaad-7aeb891429f6
> lrwxrwxrwx 1 root root 58 Mar 25 11:49
> /var/lib/ceph/osd/ceph-7/journal ->
> /dev/disk/by-partuuid/a263b19a-cb0d-4b4c-bd81-314619d5755d
> lrwxrwxrwx 1 root root 58 Mar 25 12:21
> /var/lib/ceph/osd/ceph-14/journal ->
> /dev/disk/by-partuuid/79734e0e-87dd-40c7-ba83-0d49695a75fb
> lrwxrwxrwx 1 root root 58 Mar 25 12:31
> /var/lib/ceph/osd/ceph-21/journal ->
> /dev/disk/by-partuuid/73a504bc-3179-43fd-942c-13c6bd8633c5
> lrwxrwxrwx 1 root root 58 Mar 25 12:42
> /var/lib/ceph/osd/ceph-28/journal ->
> /dev/disk/by-partuuid/ecff10df-d757-4b1f-bef4-88dd84d84ef1
> lrwxrwxrwx 1 root root 58 Mar 25 12:52
> /var/lib/ceph/osd/ceph-35/journal ->
> /dev/disk/by-partuuid/5be30238-3f07-4950-b39f-f5e4c7305e4c
> lrwxrwxrwx 1 root root 58 Mar 25 13:02
> /var/lib/ceph/osd/ceph-42/journal ->
> /dev/disk/by-partuuid/3cdb65f2-474c-47fb-8d07-83e7518418ff
> lrwxrwxrwx 1 root root 58 Mar 25 13:12
> /var/lib/ceph/osd/ceph-49/journal ->
> /dev/disk/by-partuuid/a47fe2b7-e375-4eea-b7a9-0354a24548dc
> lrwxrwxrwx 1 root root 58 Mar 25 13:22
> /var/lib/ceph/osd/ceph-56/journal ->
> /dev/disk/by-partuuid/fb42b7d6-bc6c-4063-8b73-29beb1f65107
> lrwxrwxrwx 1 root root 58 Mar 25 13:33
> /var/lib/ceph/osd/ceph-63/journal ->
> /dev/disk/by-partuuid/72aff32b-ca56-4c25-b8ea-ff3aba8db507
> lrwxrwxrwx 1 root root 58 Mar 25 13:43
> /var/lib/ceph/osd/ceph-70/journal ->
> /dev/disk/by-partuuid/b7c17a75-47cd-401e-b963-afe910612bd6
> lrwxrwxrwx 1 root root 58 Mar 25 13:53
> /var/lib/ceph/osd/ceph-77/journal ->
> /dev/disk/by-partuuid/2c1c2501-fa82-4fc9-a586-03cc4d68faef
> lrwxrwxrwx 1 root root 58 Mar 25 14:03
> /var/lib/ceph/osd/ceph-84/journal ->
> /dev/disk/by-partuuid/46f619a5-3edf-44e9-99a6-24d98bcd174a
> lrwxrwxrwx 1 root root 58 Mar 25 14:13
> /var/lib/ceph/osd/ceph-91/journal ->
> /dev/disk/by-partuuid/5feef832-dd82-4aa0-9264-dc9496a3f93a
> lrwxrwxrwx 1 root root 58 Mar 25 14:24
> /var/lib/ceph/osd/ceph-98/journal ->
> /dev/disk/by-partuu

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Robert LeBlanc
I thought there was some discussion about this before. Something like
creating a new pool and then taking your existing pool as an overlay of the
new pool  (cache) and then flush the overlay to the new pool. I haven't
tried it or know if it is possible.

The other option is shut the VM down, create a new snapshot on the new
pool, point the VM to that and then flatten the RBD.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 26, 2015 5:23 PM, "Steffen W Sørensen"  wrote:

>
> On 26/03/2015, at 23.13, Gregory Farnum  wrote:
>
> The procedure you've outlined won't copy snapshots, just the head
> objects. Preserving the proper snapshot metadata and inter-pool
> relationships on rbd images I think isn't actually possible when
> trying to change pools.
>
> This wasn’t ment for migrating a RBD pool, but pure object/Swift pools…
>
> Anyway seems Glance
> <http://docs.openstack.org/developer/glance/architecture.html#basic-architecture>
>  supports multiple storages
> <http://docs.openstack.org/developer/glance/configuring.html#configuring-multiple-swift-accounts-stores>
>  so
> assume one could use a glance client to also extract/download images into
> local file format (raw, qcow2 vmdk…) as well as uploading images to glance.
> And as glance images ain’t ‘live’ like virtual disk images one could also
> download glance images from one glance store over local file and upload
> back into a different glance back end store. Again this is properly better
> than dealing at a lower abstraction level and having to known its internal
> storage structures and avoid what you’re pointing put Greg.
>
>
>
>
>
> On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen  wrote:
>
>
> On 26/03/2015, at 23.01, Gregory Farnum  wrote:
>
> On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen  wrote:
>
>
> On 26/03/2015, at 21.07, J-P Methot  wrote:
>
> That's a great idea. I know I can setup cinder (the openstack volume
> manager) as a multi-backend manager and migrate from one backend to the
> other, each backend linking to different pools of the same ceph cluster.
> What bugs me though is that I'm pretty sure the image store, glance,
> wouldn't let me do that. Additionally, since the compute component also has
> its own ceph pool, I'm pretty sure it won't let me migrate the data through
> openstack.
>
> Hm wouldn’t it be possible to do something similar ala:
>
> # list object from src pool
> rados ls objects loop | filter-obj-id | while read obj; do
># export $obj to local disk
>rados -p pool-wth-too-many-pgs get $obj
># import $obj from local disk to new pool
>rados -p better-sized-pool put $obj
> done
>
>
> You would also have issues with snapshots if you do this on an RBD
> pool. That's unfortunately not feasible.
>
> What isn’t possible, export-import objects out-and-in of pools or snapshots
> issues?
>
> /Steffen
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Where is the systemd files?

2015-03-26 Thread Robert LeBlanc
I understand that Giant should have systemd service files, but I don't
see them in the CentOS 7 packages.

https://github.com/ceph/ceph/tree/giant/systemd

[ulhglive-root@mon1 systemd]# rpm -qa | grep --color=always ceph
ceph-common-0.93-0.el7.centos.x86_64
python-cephfs-0.93-0.el7.centos.x86_64
libcephfs1-0.93-0.el7.centos.x86_64
ceph-0.93-0.el7.centos.x86_64
ceph-deploy-1.5.22-0.noarch
[ulhglive-root@mon1 systemd]# for i in $(rpm -qa | grep ceph); do rpm
-ql $i | grep -i --color=always systemd; done
[nothing returned]

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 0.93 fresh cluster won't create PGs

2015-03-27 Thread Robert LeBlanc
in_size=2. We had
puppet add 10 OSDs on one host, and waited, the cluster became healthy
again. We had puppet add another host with 10 OSDs and waited for the
cluster to become healthy again. We had puppet add the 8 remaining
OSDs on the first host and the cluster became healthy again. We set
the CRUSH rule back to host and the cluster became healthy again.

In order to test a theory we decided to kick off puppet on the
remaining 10 hosts with 10 OSDs each at the same time (similar to what
we did before). When about the 97th OSD was added, we started getting
messages in ceph -w about stuck PGs and the cluster never became
healthy.

I wonder if there are too many changes in too short of an amount of
time causing the OSDs to overrun a journal or something (I know that
Ceph journals pgmap changes and such). I'm concerned that this could
be very detrimental in a production environment. There doesn't seem to
be a way to recover from this.

Any thoughts?

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 0.93 fresh cluster won't create PGs

2015-03-27 Thread Robert LeBlanc
Thanks, we'll give the gitbuilder packages a shot and report back.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 27, 2015 10:03 PM, "Sage Weil"  wrote:

> On Fri, 27 Mar 2015, Robert LeBlanc wrote:
> > I've built Ceph clusters a few times now and I'm completely baffled
> > about what we are seeing. We had a majority of the nodes on a new
> > cluster go down yesterday and we got PGs stuck peering. We checked
> > logs, firewalls, file descriptors, etc and nothing is pointing to what
> > the problem is. We thought we could work around the problem by
> > deleting all the pools and recreating them, but still most of the PGs
> > were in a creating+peering state. Rebooting OSDs, reformatting them,
> > adjusting the CRUSH, etc all proved fruitless. I took min_size and
> > size to 1, tried scrubbing, deep-scrubbing the PGs and OSDs. Nothing
> > seems to get the cluster to progress.
> >
> > As a last ditch effort, we wiped the whole cluster, regenerated UUID,
> > keys, etc and pushed it all through puppet again. After creating the
> > OSDs there are PGs stuck. Here is some info:
> >
> > [ulhglive-root@mon1 ~]# ceph status
> > cluster fa158fa8-3e5d-47b1-a7bc-98a41f510ac0
> >  health HEALTH_WARN
> > 1214 pgs peering
> > 1216 pgs stuck inactive
> > 1216 pgs stuck unclean
> >  monmap e2: 3 mons at
> > {mon1=
> 10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> > election epoch 6, quorum 0,1,2 mon1,mon2,mon3
> >  osdmap e161: 130 osds: 130 up, 130 in
> >   pgmap v468: 2048 pgs, 2 pools, 0 bytes data, 0 objects
> > 5514 MB used, 472 TB / 472 TB avail
> >  965 peering
> >  832 active+clean
> >  249 creating+peering
> >2 activating
>
> Usually when we've seen something like this is has been something annoying
> with the environment, like a broken network that causes the tcp streams to
> freeze once they start sending significant traffic (e.g., affecting the
> connections that transpart data but not the ones that handle heartbeats).
>
> As you're rebuilding, perhaps the issues start once you hit a particular
> rack or host?
>
> > [ulhglive-root@mon1 ~]# ceph health detail | head -n 15
> > HEALTH_WARN 1214 pgs peering; 1216 pgs stuck inactive; 1216 pgs stuck
> unclean
> > pg 2.17f is stuck inactive since forever, current state
> > creating+peering, last acting [39,42,77]
> > pg 2.17e is stuck inactive since forever, current state
> > creating+peering, last acting [125,3,110]
> > pg 2.179 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.178 is stuck inactive since forever, current state
> > creating+peering, last acting [99,120,54]
> > pg 2.17b is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.17a is stuck inactive since forever, current state
> > creating+peering, last acting [91,96,122]
> > pg 2.175 is stuck inactive since forever, current state
> > creating+peering, last acting [55,127,2]
> > pg 2.174 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.176 is stuck inactive since forever, current state
> > creating+peering, last acting [13,70,8]
> > pg 2.172 is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.16c is stuck inactive for 1344.369455, current state peering,
> > last acting [99,104,85]
> > pg 2.16e is stuck inactive since forever, current state peering, last
> acting [0]
> > pg 2.169 is stuck inactive since forever, current state
> > creating+peering, last acting [125,24,65]
> > pg 2.16a is stuck inactive since forever, current state peering, last
> acting [0]
> > Traceback (most recent call last):
> >   File "/bin/ceph", line 896, in 
> > retval = main()
> >   File "/bin/ceph", line 883, in main
> > sys.stdout.write(prefix + outbuf + suffix)
> > IOError: [Errno 32] Broken pipe
> > [ulhglive-root@mon1 ~]# ceph pg dump_stuck | head -n 15
> > ok
> > pg_stat state   up  up_primary  acting  acting_primary
> > 2.17f   creating+peering[39,42,77]  39  [39,42,77]
> 39
> > 2.17e   creating+peering[125,3,110] 125 [125,3,110]
>  125
> > 2.179   peering [0] 0   [0] 0
> > 2.178   creating+peering[99,120,54] 99  [99,120,54]
>  99
> > 2.17b   peering [0] 0   [0] 0
> > 2.17a   creat

[ceph-users] Force an OSD to try to peer

2015-03-30 Thread Robert LeBlanc
I've been working at this peering problem all day. I've done a lot of
testing at the network layer and I just don't believe that we have a
problem that would prevent OSDs from peering. When looking though osd_debug
20/20 logs, it just doesn't look like the OSDs are trying to peer. I don't
know if it is because there are so many outstanding creations or what. OSDs
will peer with OSDs on other hosts, but for reason only chooses a certain
number and not one that it needs to finish the peering process.

I've check: firewall, open files, number of threads allowed. These usually
have given me an error in the logs that helped me fix the problem.

I can't find a configuration item that specifies how many peers an OSD
should contact or anything that would be artificially limiting the peering
connections. I've restarted the OSDs a number of times, as well as
rebooting the hosts. I beleive if the OSDs finish peering everything will
clear up. I can't find anything in pg query that would help me figure out
what is blocking it (peering blocked by is empty). The PGs are scattered
across all the hosts so we can't pin it down to a specific host.

Any ideas on what to try would be appreciated.

[ulhglive-root@ceph9 ~]# ceph --version
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
[ulhglive-root@ceph9 ~]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck
inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20)
 monmap e2: 3 mons at {mon1=
10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0},
election epoch 30, quorum 0,1,2 mon1,mon2,mon3
 osdmap e704: 120 osds: 120 up, 120 in
  pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects
11447 MB used, 436 TB / 436 TB avail
 727 active+clean
 990 peering
  37 creating+peering
   1 down+peering
 290 remapped+peering
   3 creating+remapped+peering

{ "state": "peering",
  "epoch": 707,
  "up": [
40,
92,
48,
91],
  "acting": [
40,
92,
48,
91],
  "info": { "pgid": "7.171",
  "last_update": "0'0",
  "last_complete": "0'0",
  "log_tail": "0'0",
  "last_user_version": 0,
  "last_backfill": "MAX",
  "purged_snaps": "[]",
  "history": { "epoch_created": 293,
  "last_epoch_started": 343,
  "last_epoch_clean": 343,
  "last_epoch_split": 0,
  "same_up_since": 688,
  "same_interval_since": 688,
  "same_primary_since": 608,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_clean_scrub_stamp": "0.00"},
  "stats": { "version": "0'0",
  "reported_seq": "326",
  "reported_epoch": "707",
  "state": "peering",
  "last_fresh": "2015-03-30 20:10:39.509855",
  "last_change": "2015-03-30 19:44:17.361601",
  "last_active": "2015-03-30 11:37:56.956417",
  "last_clean": "2015-03-30 11:37:56.956417",
  "last_became_active": "0.00",
  "last_unstale": "2015-03-30 20:10:39.509855",
  "mapping_epoch": 683,
  "log_start": "0'0",
  "ondisk_log_start": "0'0",
  "created": 293,
  "last_epoch_clean": 343,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_clean_scrub_stamp": "0.00",
  "log_size": 0,
  "ondisk_log_size": 0,
  "stats_invalid": "0",
  "stat_sum": { "num_bytes": 0,
  "num_objects": 0,
  "num_object_clones": 0,
  "num_object_copies": 0,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_objects_dirty": 0,
  "num_whiteouts": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 0,
  "num_write_kb": 0,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 0,
  "num_bytes_recovered": 0,
  "num_keys_recovered": 0,
  "num_objects_omap": 0,
  "num_objects_hit_set_archive": 0},
  "stat_cat_sum": {},
  "up": [
40,
92,
48,
91],
  "acting": [
40,
92,
48,
91],
  "up_primary": 40,
  "acting_primary": 

[ceph-users] Fwd: Force an OSD to try to peer

2015-03-30 Thread Robert LeBlanc
Sorry HTML snuck in somewhere.

-- Forwarded message --
From: Robert LeBlanc 
Date: Mon, Mar 30, 2015 at 8:15 PM
Subject: Force an OSD to try to peer
To: Ceph-User , ceph-devel 


I've been working at this peering problem all day. I've done a lot of
testing at the network layer and I just don't believe that we have a
problem that would prevent OSDs from peering. When looking though
osd_debug 20/20 logs, it just doesn't look like the OSDs are trying to
peer. I don't know if it is because there are so many outstanding
creations or what. OSDs will peer with OSDs on other hosts, but for
reason only chooses a certain number and not one that it needs to
finish the peering process.

I've check: firewall, open files, number of threads allowed. These
usually have given me an error in the logs that helped me fix the
problem.

I can't find a configuration item that specifies how many peers an OSD
should contact or anything that would be artificially limiting the
peering connections. I've restarted the OSDs a number of times, as
well as rebooting the hosts. I beleive if the OSDs finish peering
everything will clear up. I can't find anything in pg query that would
help me figure out what is blocking it (peering blocked by is empty).
The PGs are scattered across all the hosts so we can't pin it down to
a specific host.

Any ideas on what to try would be appreciated.

[ulhglive-root@ceph9 ~]# ceph --version
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
[ulhglive-root@ceph9 ~]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck
inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20)
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0},
election epoch 30, quorum 0,1,2 mon1,mon2,mon3
 osdmap e704: 120 osds: 120 up, 120 in
  pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects
11447 MB used, 436 TB / 436 TB avail
 727 active+clean
 990 peering
  37 creating+peering
   1 down+peering
 290 remapped+peering
   3 creating+remapped+peering

{ "state": "peering",
  "epoch": 707,
  "up": [
40,
92,
48,
91],
  "acting": [
40,
92,
48,
91],
  "info": { "pgid": "7.171",
  "last_update": "0'0",
  "last_complete": "0'0",
  "log_tail": "0'0",
  "last_user_version": 0,
  "last_backfill": "MAX",
  "purged_snaps": "[]",
  "history": { "epoch_created": 293,
  "last_epoch_started": 343,
  "last_epoch_clean": 343,
  "last_epoch_split": 0,
  "same_up_since": 688,
  "same_interval_since": 688,
  "same_primary_since": 608,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_clean_scrub_stamp": "0.00"},
  "stats": { "version": "0'0",
  "reported_seq": "326",
  "reported_epoch": "707",
  "state": "peering",
  "last_fresh": "2015-03-30 20:10:39.509855",
  "last_change": "2015-03-30 19:44:17.361601",
  "last_active": "2015-03-30 11:37:56.956417",
  "last_clean": "2015-03-30 11:37:56.956417",
  "last_became_active": "0.00",
  "last_unstale": "2015-03-30 20:10:39.509855",
  "mapping_epoch": 683,
  "log_start": "0'0",
  "ondisk_log_start": "0'0",
  "created": 293,
  "last_epoch_clean": 343,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
  "last_clean_scrub_stamp": "0.00",
  "log_size": 0,
  "ondisk_log_size": 0,
  "stats_

Re: [ceph-users] Force an OSD to try to peer

2015-03-31 Thread Robert LeBlanc
Turns out jumbo frames was not set on all the switch ports. Once that
was resolved the cluster quickly became healthy.

On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc  wrote:
> I've been working at this peering problem all day. I've done a lot of
> testing at the network layer and I just don't believe that we have a problem
> that would prevent OSDs from peering. When looking though osd_debug 20/20
> logs, it just doesn't look like the OSDs are trying to peer. I don't know if
> it is because there are so many outstanding creations or what. OSDs will
> peer with OSDs on other hosts, but for reason only chooses a certain number
> and not one that it needs to finish the peering process.
>
> I've check: firewall, open files, number of threads allowed. These usually
> have given me an error in the logs that helped me fix the problem.
>
> I can't find a configuration item that specifies how many peers an OSD
> should contact or anything that would be artificially limiting the peering
> connections. I've restarted the OSDs a number of times, as well as rebooting
> the hosts. I beleive if the OSDs finish peering everything will clear up. I
> can't find anything in pg query that would help me figure out what is
> blocking it (peering blocked by is empty). The PGs are scattered across all
> the hosts so we can't pin it down to a specific host.
>
> Any ideas on what to try would be appreciated.
>
> [ulhglive-root@ceph9 ~]# ceph --version
> ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
> [ulhglive-root@ceph9 ~]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck
> inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20)
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0},
> election epoch 30, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e704: 120 osds: 120 up, 120 in
>   pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects
> 11447 MB used, 436 TB / 436 TB avail
>  727 active+clean
>  990 peering
>   37 creating+peering
>1 down+peering
>  290 remapped+peering
>3 creating+remapped+peering
>
> { "state": "peering",
>   "epoch": 707,
>   "up": [
> 40,
> 92,
> 48,
> 91],
>   "acting": [
> 40,
> 92,
> 48,
> 91],
>   "info": { "pgid": "7.171",
>   "last_update": "0'0",
>   "last_complete": "0'0",
>   "log_tail": "0'0",
>   "last_user_version": 0,
>   "last_backfill": "MAX",
>   "purged_snaps": "[]",
>   "history": { "epoch_created": 293,
>   "last_epoch_started": 343,
>   "last_epoch_clean": 343,
>   "last_epoch_split": 0,
>   "same_up_since": 688,
>   "same_interval_since": 688,
>   "same_primary_since": 608,
>   "last_scrub": "0'0",
>   "last_scrub_stamp": "2015-03-30 11:11:18.872851",
>   "last_deep_scrub": "0'0",
>   "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
>   "last_clean_scrub_stamp": "0.00"},
>   "stats": { "version": "0'0",
>   "reported_seq": "326",
>   "reported_epoch": "707",
>   "state": "peering",
>   "last_fresh": "2015-03-30 20:10:39.509855",
>   "last_change": "2015-03-30 19:44:17.361601",
>   "last_active": "2015-03-30 11:37:56.956417",
>   "last_clean": "2015-03-30 11:37:56.956417",
>   "last_became_active": "0.00",
>   "last_unstale": "2015-03-30 20:10:39.509855",
>   "mapping_epoch": 683,
>   "log_start": "0'0",
>   "ondisk_log_start": "0'0",
>   "created": 293,
>   "last_epoch_clean": 343,
>   "parent": "0.0",
>   "parent_split_bits": 0,
>   "las

Re: [ceph-users] Force an OSD to try to peer

2015-03-31 Thread Robert LeBlanc
I was desperate for anything after exhausting every other possibility
I could think of. Maybe I should put a checklist in the Ceph docs of
things to look for.

Thanks,

On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil  wrote:
> On Tue, 31 Mar 2015, Robert LeBlanc wrote:
>> Turns out jumbo frames was not set on all the switch ports. Once that
>> was resolved the cluster quickly became healthy.
>
> I always hesitate to point the finger at the jumbo frames configuration
> but almost every time that is the culprit!
>
> Thanks for the update.  :)
> sage
>
>
>
>>
>> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc  wrote:
>> > I've been working at this peering problem all day. I've done a lot of
>> > testing at the network layer and I just don't believe that we have a 
>> > problem
>> > that would prevent OSDs from peering. When looking though osd_debug 20/20
>> > logs, it just doesn't look like the OSDs are trying to peer. I don't know 
>> > if
>> > it is because there are so many outstanding creations or what. OSDs will
>> > peer with OSDs on other hosts, but for reason only chooses a certain number
>> > and not one that it needs to finish the peering process.
>> >
>> > I've check: firewall, open files, number of threads allowed. These usually
>> > have given me an error in the logs that helped me fix the problem.
>> >
>> > I can't find a configuration item that specifies how many peers an OSD
>> > should contact or anything that would be artificially limiting the peering
>> > connections. I've restarted the OSDs a number of times, as well as 
>> > rebooting
>> > the hosts. I beleive if the OSDs finish peering everything will clear up. I
>> > can't find anything in pg query that would help me figure out what is
>> > blocking it (peering blocked by is empty). The PGs are scattered across all
>> > the hosts so we can't pin it down to a specific host.
>> >
>> > Any ideas on what to try would be appreciated.
>> >
>> > [ulhglive-root@ceph9 ~]# ceph --version
>> > ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>> > [ulhglive-root@ceph9 ~]# ceph status
>> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>> >  health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck
>> > inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20)
>> >  monmap e2: 3 mons at
>> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0},
>> > election epoch 30, quorum 0,1,2 mon1,mon2,mon3
>> >  osdmap e704: 120 osds: 120 up, 120 in
>> >   pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects
>> > 11447 MB used, 436 TB / 436 TB avail
>> >  727 active+clean
>> >  990 peering
>> >   37 creating+peering
>> >1 down+peering
>> >  290 remapped+peering
>> >3 creating+remapped+peering
>> >
>> > { "state": "peering",
>> >   "epoch": 707,
>> >   "up": [
>> > 40,
>> > 92,
>> > 48,
>> > 91],
>> >   "acting": [
>> > 40,
>> > 92,
>> > 48,
>> > 91],
>> >   "info": { "pgid": "7.171",
>> >   "last_update": "0'0",
>> >   "last_complete": "0'0",
>> >   "log_tail": "0'0",
>> >   "last_user_version": 0,
>> >   "last_backfill": "MAX",
>> >   "purged_snaps": "[]",
>> >   "history": { "epoch_created": 293,
>> >   "last_epoch_started": 343,
>> >   "last_epoch_clean": 343,
>> >   "last_epoch_split": 0,
>> >   "same_up_since": 688,
>> >   "same_interval_since": 688,
>> >   "same_primary_since": 608,
>> >   "last_scrub": "0'0",
>> >   "last_scrub_stamp": "2015-03-30 11:11:18.872851",
>> >   "last_deep_scrub": "0'0",
>> >   "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851",
&g

Re: [ceph-users] Force an OSD to try to peer

2015-03-31 Thread Robert LeBlanc
At the L2 level, if the hosts and switches don't accept jumbo frames,
they just drop them because they are too big. They are not fragmented
because they don't go through a router. My problem is that OSDs were
able to peer with other OSDs on the host, but my guess is that they
never sent/received packets larger than 1500 bytes. Then other OSD
processes tried to peer but sent packets larger than 1500 bytes
causing the packets to be dropped and peering to stall.

On Tue, Mar 31, 2015 at 12:10 PM, Somnath Roy  wrote:
> But, do we know why Jumbo frames may have an impact on peering ?
> In our setup so far, we haven't enabled jumbo frames other than performance 
> reason (if at all).
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Tuesday, March 31, 2015 11:08 AM
> To: Sage Weil
> Cc: ceph-devel; Ceph-User
> Subject: Re: [ceph-users] Force an OSD to try to peer
>
> I was desperate for anything after exhausting every other possibility I could 
> think of. Maybe I should put a checklist in the Ceph docs of things to look 
> for.
>
> Thanks,
>
> On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil  wrote:
>> On Tue, 31 Mar 2015, Robert LeBlanc wrote:
>>> Turns out jumbo frames was not set on all the switch ports. Once that
>>> was resolved the cluster quickly became healthy.
>>
>> I always hesitate to point the finger at the jumbo frames
>> configuration but almost every time that is the culprit!
>>
>> Thanks for the update.  :)
>> sage
>>
>>
>>
>>>
>>> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc  
>>> wrote:
>>> > I've been working at this peering problem all day. I've done a lot
>>> > of testing at the network layer and I just don't believe that we
>>> > have a problem that would prevent OSDs from peering. When looking
>>> > though osd_debug 20/20 logs, it just doesn't look like the OSDs are
>>> > trying to peer. I don't know if it is because there are so many
>>> > outstanding creations or what. OSDs will peer with OSDs on other
>>> > hosts, but for reason only chooses a certain number and not one that it 
>>> > needs to finish the peering process.
>>> >
>>> > I've check: firewall, open files, number of threads allowed. These
>>> > usually have given me an error in the logs that helped me fix the problem.
>>> >
>>> > I can't find a configuration item that specifies how many peers an
>>> > OSD should contact or anything that would be artificially limiting
>>> > the peering connections. I've restarted the OSDs a number of times,
>>> > as well as rebooting the hosts. I beleive if the OSDs finish
>>> > peering everything will clear up. I can't find anything in pg query
>>> > that would help me figure out what is blocking it (peering blocked
>>> > by is empty). The PGs are scattered across all the hosts so we can't pin 
>>> > it down to a specific host.
>>> >
>>> > Any ideas on what to try would be appreciated.
>>> >
>>> > [ulhglive-root@ceph9 ~]# ceph --version ceph version 0.80.7
>>> > (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>> > [ulhglive-root@ceph9 ~]# ceph status
>>> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>>> >  health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs
>>> > stuck inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20)
>>> >  monmap e2: 3 mons at
>>> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.2
>>> > 9:6789/0}, election epoch 30, quorum 0,1,2 mon1,mon2,mon3
>>> >  osdmap e704: 120 osds: 120 up, 120 in
>>> >   pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects
>>> > 11447 MB used, 436 TB / 436 TB avail
>>> >  727 active+clean
>>> >  990 peering
>>> >   37 creating+peering
>>> >1 down+peering
>>> >  290 remapped+peering
>>> >3 creating+remapped+peering
>>> >
>>> > { "state": "peering",
>>> >   "epoch": 707,
>>> >   "up": [
>>> > 40,
>>> > 92,
>>> > 48,
>>> > 91],
>>> >   "acting": [
>&g

[ceph-users] What are you doing to locate performance issues in a Ceph cluster?

2015-04-06 Thread Robert LeBlanc
I see that ceph has 'ceph osd perf' that gets the latency of the OSDs.
Is there a similar command that would provide some performance data
about RBDs in use? I'm concerned about out ability to determine which
RBD(s) may be "abusing" our storage at any given time.

What are others doing to locate performance issues in their Ceph clusters?

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to dispatch monitors in a multi-site cluster (ie in 2 datacenters)

2015-04-13 Thread Robert LeBlanc
I really like this proposal.

On Mon, Apr 13, 2015 at 2:33 AM, Joao Eduardo Luis  wrote:
> On 04/13/2015 02:25 AM, Christian Balzer wrote:
>> On Sun, 12 Apr 2015 14:37:56 -0700 Gregory Farnum wrote:
>>
>>> On Sun, Apr 12, 2015 at 1:58 PM, Francois Lafont 
>>> wrote:
 Somnath Roy wrote:

> Interesting scenario :-).. IMHO, I don't think cluster will be in
> healthy state here if the connections between dc1 and dc2 is cut. The
> reason is the following.
>
> 1. only osd.5 can talk to both data center  OSDs and other 2 mons
> will not be. So, they can't reach to an agreement (and form quorum)
> about the state of OSDs.
>
> 2. OSDs on dc1 and dc2 will not be able to talk to each other,
> considering replicas across data centers, the cluster will be broken.

 Yes, in fact, after thought, I have the first question below.

 If: (more clear with a schema is the head ;))

 1. mon.1 and mon.2 can talk together (in dc1) and can talk with
 mon.5 (via the VPN) but can't talk with mon.3 and mon.4 (in dc2)
 2. mon.3 and mon.4 can talk together (in dc2) and can talk with
 mon.5 (via the VPN) but can't talk with mon.1 and mon.2 (in dc1)
 3. mon.5 can talk with mon.1, mon.2, mon.3, mon.4 and mon.5

 is the quorum reached? If yes, which is the quorum?
>>>
>>> Yes, you should get a quorum as mon.5 will vote for one datacenter or
>>> the other. Which one it chooses will depend on which monitor has the
>>> "lowest" IP address (I think, or maybe just the monitor IDs or
>>> something? Anyway, it's a consistent ordering).
>>
>> Pet peeve alert. ^_-
>>
>> It's the lowest IP.
>
> To be more precise, it's the lowest IP:PORT combination:
>
> 10.0.1.2:6789 = rank 0
> 10.0.1.2:6790 = rank 1
> 10.0.1.3:6789 = rank 3
>
> and so on.
>
>> Which is something that really needs to be documented (better) so that
>> people can plan things accordingly and have the leader monitor wind up on
>> the best suited hardware (in case not everything is being equal).
>>
>> Other than that, the sequence of how (initial?) mons are listed in
>> ceph.conf would of course be the most natural, expected way to sort
>> monitors.
>
> I don't agree.  I find it hard to rely on ceph.conf for sensitive
> decisions like this, because we must ensure that ceph.conf is the same
> in all the nodes;  and I've seen this not being the case more often than
> not.
>
> On the other hand, I do agree that we should make it easier for people
> to specify which monitors they want in the line of succession to the
> leader, so that they can plan their clusters accordingly.  I do believe
> we can set this on the monmap, ideally once the first quorum is formed;
> something like:
>
> ceph mon rank set mon.a 0
> ceph mon rank set mon.b 2
> ceph mon rank set mon.c 1
>
> ceph mon rank list
>
>   MON   IP:PORT   RANK POLICYSTATUS
>   mon.a 10.0.1.2:6789 rank 0  [set-by-user]  leader
>   mon.c 10.0.1.3:6789 rank 1  [set-by-user]  peon
>   mon.b 10.0.1.2:6790 rank 2  [set-by-user]  down
>   mon.d 10.0.1.4:6789 rank 3  [default]  peon
>
>
> Thoughts?
>
>   -Joao
>
>>
>> Christian
>>
>>
>>> Under no circumstances
>>> whatsoever will mon.5 help each datacenter create their own quorums at
>>> the same time. The other data center will just be out of luck and
>>> unable to do anything.
>>> Although it's possible that the formed quorum won't be very stable
>>> since the out-of-quorum monitors will probably keep trying to form a
>>> quorum and that might make mon.5 unhappy. You should test what happens
>>> with that kind of net split. :)
>>> -Greg
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low power single disk nodes

2015-04-13 Thread Robert LeBlanc
We are getting ready to put the Quantas into production. We looked at
the Supermico Atoms (we have 6 of them), the rails were crap (they
exploded the first time you pull the server out, and they stick out of
the back of the cabinet about 8 inches, these boxes are already very
deep), we also ran out of CPU on these boxes and had limited PCI I/O).
They may work fine for really cold data. It may also work fine with
XIO and Infiniband. The Atoms still had pretty decent performance
given these limitations.

The Quantas removed some of the issues with NUMA, had much better PCI
I/O bandwidth, comes with a 10 Gb NIC on board. The biggest drawback
is that 8 drives is on a SAS controller and 4 drives are on a SATA
controller, plus SATADOM and a free port. So you have to manage two
different controller types and speeds (6Gb SAS and 3Gb SATA).

I'd say neither is perfect, but we decided on Quanta in the end.

On Mon, Apr 13, 2015 at 5:17 AM, Jerker Nyberg  wrote:
>
> Hello,
>
> Thanks for all replies! The Banana Pi could work. The built in SATA-power in
> Banana Pi can power a 2.5" SATA disk. Cool. (Not 3.5" SATA since that seem
> to require 12 V too.)
>
> I found this post from Vess Bakalov about the same subject:
> http://millibit.blogspot.se/2015/01/ceph-pi-adding-osd-and-more-performance.html
>
> For PoE I have only found Intel Galileo Gen 2 or RouterBOARD RB450G which
> are too slow and/or miss IO-expansion. (But good for signage/Xibo maybe!)
>
> I found two boxes from Quanta and SuperMicro with single socket Xeon or with
> Intel Atom (Avaton) that might be quite ok. I was only aware of the
> dual-Xeons before.
>
> http://www.quantaqct.com/Product/Servers/Rackmount-Servers/STRATOS-S100-L11SL-p151c77c70c83
> http://www.supermicro.nl/products/system/1U/5018/SSG-5018A-AR12L.cfm
>
> Kind regards,
> Jerker Nyberg
>
>
>
>
> On Thu, 9 Apr 2015, Quentin Hartman wrote:
>
>> I'm skeptical about how well this would work, but a Banana Pi might be a
>> place to start. Like a raspberry pi, but it has a SATA connector:
>> http://www.bananapi.org/
>>
>> On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg  wrote:
>>
>>>
>>> Hello ceph users,
>>>
>>> Is anyone running any low powered single disk nodes with Ceph now?
>>> Calxeda
>>> seems to be no more according to Wikipedia. I do not think HP moonshot is
>>> what I am looking for - I want stand-alone nodes, not server cartridges
>>> integrated into server chassis. And I do not want to be locked to a
>>> single
>>> vendor.
>>>
>>> I was playing with Raspberry Pi 2 for signage when I thought of my old
>>> experiments with Ceph.
>>>
>>> I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or maybe
>>> something with a low-power Intel x64/x86 processor. Together with one SSD
>>> or one low power HDD the node could get all power via PoE (via splitter
>>> or
>>> integrated into board if such boards exist). PoE provide remote power-on
>>> power-off even for consumer grade nodes.
>>>
>>> The cost for a single low power node should be able to compete with
>>> traditional PC-servers price per disk. Ceph take care of redundancy.
>>>
>>> I think simple custom casing should be good enough - maybe just strap or
>>> velcro everything on trays in the rack, at least for the nodes with SSD.
>>>
>>> Kind regards,
>>> --
>>> Jerker Nyberg, Uppsala, Sweden.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network redundancy pro and cons, best practice, suggestions?

2015-04-13 Thread Robert LeBlanc
For us, using two 40Gb ports with VLANs is redundancy enough. We are
doing LACP over two different switches.

On Mon, Apr 13, 2015 at 3:03 AM, Götz Reinicke - IT Koordinator
 wrote:
> Dear ceph users,
>
> we are planing a ceph storage cluster from scratch. Might be up to 1 PB
> within the next 3 years, multiple buildings, new network infrastructure
> for the cluster etc.
>
> I had some excellent trainings on ceph, so the essential fundamentals
> are familiar to me, and I know our goals/dreams can be reached. :)
>
> There is just "one tiny piece" in the design I'm currently unsure about :)
>
> Ceph follows some sort of keep it small and simple, e.g. dont use raid
> controllers, use more boxes and disks, fast network etc.
>
> So from our current design we plan 40Gb Storage and Client LAN.
>
> Would you suggest to connect the OSD nodes redundant to both networks?
> That would end up with 4 * 40Gb ports in each box, two Switches to
> connect to.
>
> I'd think of OSD nodes with 12 - 16 * 4TB SATA disks for "high" io
> pools. (+ currently SSD for journal, but may be until we start, levelDB,
> rocksDB are ready ... ?)
>
> Later some less io bound pools for data archiving/backup. (bigger and
> more Disks per node)
>
> We would also do some Cache tiering for some pools.
>
> From HP, Intel, Supermicron etc reference documentations, they use
> usually non-redundant network connection. (single 10Gb)
>
> I know: redundancy keeps some headaches small, but also adds some more
> complexity and increases the budget. (add network adapters, other
> server, more switches, etc)
>
> So what would you suggest, what are your experiences?
>
> Thanks for any suggestion and feedback . Regards . Götz
> --
> Götz Reinicke
> IT-Koordinator
>
> Tel. +49 7141 969 82 420
> E-Mail goetz.reini...@filmakademie.de
>
> Filmakademie Baden-Württemberg GmbH
> Akademiehof 10
> 71638 Ludwigsburg
> www.filmakademie.de
>
> Eintragung Amtsgericht Stuttgart HRB 205016
>
> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
> Staatssekretär im Ministerium für Wissenschaft,
> Forschung und Kunst Baden-Württemberg
>
> Geschäftsführer: Prof. Thomas Schadt
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low power single disk nodes

2015-04-13 Thread Robert LeBlanc
We also got one of those too. I think the cabling on the front and
limited I/O options deterred us, otherwise, I really liked that box
too.

On Mon, Apr 13, 2015 at 10:34 AM, Nick Fisk  wrote:
> I went for something similar to the Quantas boxes but 4 stacked in 1x 4U box
>
> http://www.supermicro.nl/products/system/4U/F617/SYS-F617H6-FTPT_.cfm
>
> When you do the maths, even something like a banana pi + disk starts costing
> a similar amount and you get so much more for your money in temrs of
> processing power, NIC bandwidth...etc
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Robert LeBlanc
>> Sent: 13 April 2015 17:27
>> To: Jerker Nyberg
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] low power single disk nodes
>>
>> We are getting ready to put the Quantas into production. We looked at the
>> Supermico Atoms (we have 6 of them), the rails were crap (they exploded
>> the first time you pull the server out, and they stick out of the back of
> the
>> cabinet about 8 inches, these boxes are already very deep), we also ran
> out
>> of CPU on these boxes and had limited PCI I/O).
>> They may work fine for really cold data. It may also work fine with XIO
> and
>> Infiniband. The Atoms still had pretty decent performance given these
>> limitations.
>>
>> The Quantas removed some of the issues with NUMA, had much better PCI
>> I/O bandwidth, comes with a 10 Gb NIC on board. The biggest drawback is
>> that 8 drives is on a SAS controller and 4 drives are on a SATA
> controller, plus
>> SATADOM and a free port. So you have to manage two different controller
>> types and speeds (6Gb SAS and 3Gb SATA).
>>
>> I'd say neither is perfect, but we decided on Quanta in the end.
>>
>> On Mon, Apr 13, 2015 at 5:17 AM, Jerker Nyberg 
>> wrote:
>> >
>> > Hello,
>> >
>> > Thanks for all replies! The Banana Pi could work. The built in
>> > SATA-power in Banana Pi can power a 2.5" SATA disk. Cool. (Not 3.5"
>> > SATA since that seem to require 12 V too.)
>> >
>> > I found this post from Vess Bakalov about the same subject:
>> > http://millibit.blogspot.se/2015/01/ceph-pi-adding-osd-and-more-perfor
>> > mance.html
>> >
>> > For PoE I have only found Intel Galileo Gen 2 or RouterBOARD RB450G
>> > which are too slow and/or miss IO-expansion. (But good for
>> > signage/Xibo maybe!)
>> >
>> > I found two boxes from Quanta and SuperMicro with single socket Xeon
>> > or with Intel Atom (Avaton) that might be quite ok. I was only aware
>> > of the dual-Xeons before.
>> >
>> > http://www.quantaqct.com/Product/Servers/Rackmount-
>> Servers/STRATOS-S10
>> > 0-L11SL-p151c77c70c83
>> > http://www.supermicro.nl/products/system/1U/5018/SSG-5018A-
>> AR12L.cfm
>> >
>> > Kind regards,
>> > Jerker Nyberg
>> >
>> >
>> >
>> >
>> > On Thu, 9 Apr 2015, Quentin Hartman wrote:
>> >
>> >> I'm skeptical about how well this would work, but a Banana Pi might
>> >> be a place to start. Like a raspberry pi, but it has a SATA connector:
>> >> http://www.bananapi.org/
>> >>
>> >> On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg 
>> wrote:
>> >>
>> >>>
>> >>> Hello ceph users,
>> >>>
>> >>> Is anyone running any low powered single disk nodes with Ceph now?
>> >>> Calxeda
>> >>> seems to be no more according to Wikipedia. I do not think HP
>> >>> moonshot is what I am looking for - I want stand-alone nodes, not
>> >>> server cartridges integrated into server chassis. And I do not want
>> >>> to be locked to a single vendor.
>> >>>
>> >>> I was playing with Raspberry Pi 2 for signage when I thought of my
>> >>> old experiments with Ceph.
>> >>>
>> >>> I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or maybe
>> >>> something with a low-power Intel x64/x86 processor. Together with
>> >>> one SSD or one low power HDD the node could get all power via PoE
>> >>> (via splitter or integrated into board if such boards exist). PoE
>> >>> provide remote power-on power-off even for consumer grade nodes.
>> >>>
>> >>> The cost for a single low power node should be able to compete with
>> >>> tradi

[ceph-users] norecover and nobackfill

2015-04-13 Thread Robert LeBlanc
I'm looking for documentation about what exactly each of these do and
I can't find it. Can someone point me in the right direction?

The names seem too ambiguous to come to any conclusion about what
exactly they do.

Thanks,
Robert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] norecover and nobackfill

2015-04-13 Thread Robert LeBlanc
After doing some testing, I'm a bit confused even more.

What I'm trying to achieve is minimal data movement when I have to service
a node to replace a failed drive. Since these nodes don't have hot-swap
bays, I'll need to power down the box to replace the failed drive. I don't
want Ceph to shuffle data until the new drive comes up and is ready.

My thought was to set norecover nobackfill, take down the host, replace the
drive, start the host, remove the old OSD from the cluster, ceph-disk
prepare the new disk then unset norecover nobackfill.

However in my testing with a 4 node cluster ( v.94.0 10 OSDs each,
replication 3, min_size 2, chooselead_fristn host), if I take down a host
I/O becomes blocked even though only one copy should be taken down and
still satisfies min_size. When I unset norecover, then I/O proceeds and
some backfill activity happens. At some point the backfill stops and
everything seems to be "happy" in the degraded state.

I'm really interested to know what is going on with "norecover" as the
cluster seems to break if it is set. Unsetting the "norecover" flag causes
some degraded objects to recover, but not all. Writing to new blocks in an
RBD causes the number of degraded objects to increase, but works just fine
otherwise. Here is an example after taking down one host and removing the
OSDs from the CRUSH map (I'm reformatting all the drives in the host
currently).

# ceph status
cluster 146c4fe8-7c85-46dc-b8b3-69072d658287
 health HEALTH_WARN
1345 pgs backfill
10 pgs backfilling
2016 pgs degraded
661 pgs recovery_wait
2016 pgs stuck degraded
2016 pgs stuck unclean
1356 pgs stuck undersized
1356 pgs undersized
recovery 40642/167785 objects degraded (24.223%)
recovery 31481/167785 objects misplaced (18.763%)
too many PGs per OSD (665 > max 300)
nobackfill flag(s) set
 monmap e5: 3 mons at {nodea=
10.8.6.227:6789/0,nodeb=10.8.6.228:6789/0,nodec=10.8.6.229:6789/0}
election epoch 2576, quorum 0,1,2 nodea,nodeb,nodec
 osdmap e59031: 30 osds: 30 up, 30 in; 1356 remapped pgs
flags nobackfill
  pgmap v4723208: 6656 pgs, 4 pools, 330 GB data, 53235 objects
863 GB used, 55000 GB / 55863 GB avail
40642/167785 objects degraded (24.223%)
31481/167785 objects misplaced (18.763%)
4640 active+clean
1345 active+undersized+degraded+remapped+wait_backfill
 660 active+recovery_wait+degraded
  10 active+undersized+degraded+remapped+backfilling
   1 active+recovery_wait+undersized+degraded+remapped
  client io 1864 kB/s rd, 8853 kB/s wr, 65 op/s

Any help understanding these flags would be very helpful.

Thanks,
Robert

On Mon, Apr 13, 2015 at 1:40 PM, Robert LeBlanc 
wrote:

> I'm looking for documentation about what exactly each of these do and
> I can't find it. Can someone point me in the right direction?
>
> The names seem too ambiguous to come to any conclusion about what
> exactly they do.
>
> Thanks,
> Robert
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] norecover and nobackfill

2015-04-14 Thread Robert LeBlanc
HmmmI've been deleting the OSD (ceph osd rm X; ceph osd crush rm osd.X)
along with removing the auth key. This has caused data movement, but
reading your reply and thinking about it made me think it should be done
differently. I should just remove the auth key and leave the OSD in the
CRUSH map. That should work, I'll test it on my cluster.

I'd still like to know the difference between norecover and nobackfill if
anyone knows.

On Mon, Apr 13, 2015 at 7:40 PM, Francois Lafont  wrote:

> Hi,
>
> Robert LeBlanc wrote:
>
> > What I'm trying to achieve is minimal data movement when I have to
> service
> > a node to replace a failed drive. [...]
>
> I will perhaps say something stupid but it seems to me that it's the
> goal of the "noout" flag, isn't it?
>
> 1. ceph osd set noout
> 2. an old OSD disk failed, no rebalancing of data because noout is set,
> the cluster is just degraded.
> 3. You remove of the cluster the OSD daemon which used the old disk.
> 4. You power off the host and replace the old disk by a new disk and you
> restart the host.
> 5. You create a new OSD on the new disk.
>
> With these steps, there will be no movement of data. Only during the step 5
> where the data will be recreated in the new disk (but it's normal and
> desired).
>
> Sorry in advance if there is something I'm missing in your problem.
> Regards.
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] norecover and nobackfill

2015-04-14 Thread Robert LeBlanc
OK, I remember now, if I don't remove the OSD from the CRUSH, ceph-disk
will get a new OSD ID and the old one will hang around as a zombie. This
will change the host/rack/etc weights causing cluster wide rebalance.

On Tue, Apr 14, 2015 at 9:31 AM, Robert LeBlanc 
wrote:

> HmmmI've been deleting the OSD (ceph osd rm X; ceph osd crush rm
> osd.X) along with removing the auth key. This has caused data movement, but
> reading your reply and thinking about it made me think it should be done
> differently. I should just remove the auth key and leave the OSD in the
> CRUSH map. That should work, I'll test it on my cluster.
>
> I'd still like to know the difference between norecover and nobackfill if
> anyone knows.
>
> On Mon, Apr 13, 2015 at 7:40 PM, Francois Lafont 
> wrote:
>
>> Hi,
>>
>> Robert LeBlanc wrote:
>>
>> > What I'm trying to achieve is minimal data movement when I have to
>> service
>> > a node to replace a failed drive. [...]
>>
>> I will perhaps say something stupid but it seems to me that it's the
>> goal of the "noout" flag, isn't it?
>>
>> 1. ceph osd set noout
>> 2. an old OSD disk failed, no rebalancing of data because noout is set,
>> the cluster is just degraded.
>> 3. You remove of the cluster the OSD daemon which used the old disk.
>> 4. You power off the host and replace the old disk by a new disk and you
>> restart the host.
>> 5. You create a new OSD on the new disk.
>>
>> With these steps, there will be no movement of data. Only during the step
>> 5
>> where the data will be recreated in the new disk (but it's normal and
>> desired).
>>
>> Sorry in advance if there is something I'm missing in your problem.
>> Regards.
>>
>>
>> --
>> François Lafont
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph repo - RSYNC?

2015-04-15 Thread Robert LeBlanc
http://eu.ceph.com/ has rsync and Hammer.

On Wed, Apr 15, 2015 at 10:17 AM, Paul Mansfield <
paul.mansfi...@alcatel-lucent.com> wrote:

>
> Sorry for starting a new thread, I've only just subscribed to the list
> and the archive on the mail listserv is far from complete at the moment.
>
> on 8th March David Moreau Simard said
>   http://www.spinics.net/lists/ceph-users/msg16334.html
> that there was a rsync'able mirror of the ceph repo at
> http://ceph.mirror.iweb.ca/
>
>
> My problem is that the repo doesn't include Hammer. Is there someone who
> can get that added to the mirror?
>
> thanks very much
> Paul
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace dead SSD journal

2015-04-17 Thread Robert LeBlanc
Delete and re-add all six OSDs.

On Fri, Apr 17, 2015 at 3:36 AM, Andrija Panic 
wrote:

> Hi guys,
>
> I have 1 SSD that hosted 6 OSD's Journals, that is dead, so 6 OSD down,
> ceph rebalanced etc.
>
> Now I have new SSD inside, and I will partition it etc - but would like to
> know, how to proceed now, with the journal recreation for those 6 OSDs that
> are down now.
>
> Should I flush journal (where to, journals doesnt still exist...?), or
> just recreate journal from scratch (making symboliv links again: ln -s
> /dev/$DISK$PART /var/lib/ceph/osd/ceph-$ID/journal) and starting OSDs.
>
> I expect the folowing procedure, but would like confirmation please:
>
> rm /var/lib/ceph/osd/ceph-$ID/journal -f (sym link)
> ln -s /dev/SDAxxx /var/lib/ceph/osd/ceph-$ID/journal
> ceph-osd -i $ID --mkjournal
> ll /var/lib/ceph/osd/ceph-$ID/journal
> service ceph start osd.$ID
>
> Any thought greatly appreciated !
>
> Thanks,
>
> --
>
> Andrija Panić
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   >