[ceph-users] Nagios Check for Ceph-Dash

2014-06-02 Thread Christian Eichelmann
Hi Folks! For those of you, who are using ceph-dash (https://github.com/Crapworks/ceph-dash), I've created a Nagios-Plugin, that uses the json endpoint to monitor your cluster remotely: * https://github.com/Crapworks/check_ceph_dash I think this can be easily adopted to use the ceph-rest-api as

[ceph-users] PG Scrub Error / active+clean+inconsistent

2014-06-10 Thread Christian Eichelmann
Hi all, after coming back from a long weekend, I found my production cluster in an error state, mentioning 6 scrub errors and 6 pg's in active+clean+inconsistent state. Strange is, that my Prelive-Cluster, running on different Hardware, are also showing 1 scrub error and 1 inconsisten pg... pg d

Re: [ceph-users] PG Scrub Error / active+clean+inconsistent

2014-06-10 Thread Christian Eichelmann
Hi again, just found the ceph pg repair command :) Now both clusters are OK again. Anyways, I'm really interested in the caus of the problem. Regards, Christian Am 10.06.2014 10:28, schrieb Christian Eichelmann: > Hi all, > > after coming back from a long weekend, I found my prod

[ceph-users] Behaviour of ceph pg repair on different replication levels

2014-06-23 Thread Christian Eichelmann
Hi ceph users, since our cluster had a few inconsistent pgs in the last time, i was wondering what ceph pg repair does, depending on the replication level. So I just wanted to check if my assumptions are correct: Replication 2x Since the cluster can not decide which version is correct one, it wou

Re: [ceph-users] external monitoring tools for ceph

2014-07-01 Thread Christian Eichelmann
>>> Is there any other tool which can also be used to monitor ceph >>>> specially for object storage? >>>> >>>> Regards >>>> Pragya Jain >>>> ___ >>>> ceph-users mailin

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Christian Eichelmann
I can also confirm that after upgrading to firefly both of our clusters (test and live) were going from 0 scrub errors each for about 6 Month to about 9-12 per week... This also makes me kind of nervous, since as far as I know everything "ceph pg repair" does, is to copy the primary object to al

[ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

2014-09-12 Thread Christian Eichelmann
Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs, other osds are crashing. When I try to start them again I see different kind

Re: [ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

2014-09-12 Thread Christian Eichelmann
Hi, I am running all commands as root, so there are no limits for the processes. Regards, Christian ___ Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com] Gesendet: Freitag, 12. September 2014 15:33 An: Christian Eichelmann Cc: ceph-users

Re: [ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

2014-09-15 Thread Christian Eichelmann
u're hitting, > but that is the most likely one > > Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy. > I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals) > with that kind of case and enjoy the fact that my OSDs never fail. ^o^ > >

Re: [ceph-users] OSDs are crashing with "Cannot fork" or "cannot create thread" but plenty of memory is left

2014-09-23 Thread Christian Eichelmann
re's an issue here http://tracker.ceph.com/issues/6142 , although it > doesn't seem to have gotten much traction in terms of informing users. > > Regards > Nathan > > On 15/09/2014 7:13 PM, Christian Eichelmann wrote: >> Hi all, >> >> I have no idea why runni

[ceph-users] Monitor Restart triggers half of our OSDs marked down

2015-02-03 Thread Christian Eichelmann
Hi all, during some failover tests and some configuration tests, we currently discover a strange phenomenon: Restarting one of our monitors (5 in sum) triggers about 300 of the following events: osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after 22.005858 >= grace 20.00)

Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down

2015-02-04 Thread Christian Eichelmann
s and something is happening when one of the goes down. If I can provide you any more information to clarify the issue, just tell me what you need. Regards, Christian Am 03.02.2015 18:10, schrieb Gregory Farnum: > On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann > wrote: >>

Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down

2015-02-05 Thread Christian Eichelmann
d.1202 128.142.23.104:6801/98353 59 : > [WRN] map e132056 wrongly marked me down > 2015-01-29 11:29:35.441922 osd.1164 128.142.23.102:6850/22486 25 : > [WRN] map e132056 wrongly marked me down The behaviour is exactly the same on our system, to it looks like the same issue. We are current runni

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Christian Eichelmann
nta 14, P. O. Box 405, FIN-02101 Espoo, Finland > mobile: +358 503 812758 > tel. +358 9 4572001 > fax +358 9 4572302 > http://www.csc.fi/ > > > > > ___ > cep

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-10 Thread Christian Eichelmann
where in the docs we can put this to catch more users? Or maybe a warning issued by the osds themselves or something if they see limits that are low? sage - Karan - On 09 Mar 2015, at 14:48, Christian Eichelmann wrote: Hi Karan, as you are actually writing in your own book, the p

[ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-20 Thread Christian Eichelmann
Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy)

Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-20 Thread Christian Eichelmann
e may be a fix which might stop this from > happening. > > Nick > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Christian Eichelmann >> Sent: 20 April 2015 08:29 >> To: ceph-users@lists.ce

Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-20 Thread Christian Eichelmann
ble. The >>> RBD client is Kernel based and so there may be a fix which might stop >>> this from happening. >>> >>> Nick >>> >>>> -Original Message- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf &

Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-21 Thread Christian Eichelmann
mes and never >> really >> got to the bottom of it, whereas the same volumes formatted with EXT4 has >> been running for years without a problem. >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >&

Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-21 Thread Christian Eichelmann
g, was there anything printed to dmesg ? > Cheers, Dan > > On Mon, Apr 20, 2015 at 9:29 AM, Christian Eichelmann > wrote: >> Hi Ceph-Users! >> >> We currently have a problem where I am not sure if the it has it's cause >> in Ceph or something else. First, some

Re: [ceph-users] 100% IO Wait with CEPH RBD and RSYNC

2015-04-21 Thread Christian Eichelmann
cluster. Is iptables getting in the way ? > > Cheers, Dan > > On Tue, Apr 21, 2015 at 9:13 AM, Christian Eichelmann > wrote: >> Hi Dan, >> >> we are alreay back on the kernel module since the same problems were >> happening with fuse. I had no special ulimit setting

[ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Christian Eichelmann
Hi all! We are experiencing approximately 1 scrub error / inconsistent pg every two days. As far as I know, to fix this you can issue a "ceph pg repair", which works fine for us. I have a few qestions regarding the behavior of the ceph cluster in such a case: 1. After ceph detects the scrub error

Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Christian Eichelmann
've only tested this on an idle cluster, so I don't know how well it >> will work on an active cluster. Since we issue a deep-scrub, if the PGs >> of the replicas change during the rsync, it should come up with an >> error. The idea is to keep rsyncing unti

[ceph-users] Remove RBD Image

2015-07-29 Thread Christian Eichelmann
Hi all, I am trying to remove several rbd images from the cluster. Unfortunately, that doesn't work: $ rbd info foo rbd image 'foo': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rb.0.919443.238e1f29 format: 1 $ rbd rm foo 2015-07-2

Re: [ceph-users] Remove RBD Image

2015-07-29 Thread Christian Eichelmann
at 11:30 AM, Christian Eichelmann > wrote: >> Hi all, >> >> I am trying to remove several rbd images from the cluster. >> Unfortunately, that doesn't work: >> >> $ rbd info foo >> rbd image 'foo': >>

[ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-29 Thread Christian Eichelmann
Hi all, we have a ceph cluster, with currently 360 OSDs in 11 Systems. Last week we were replacing one OSD System with a new one. During that, we had a lot of problems with OSDs crashing on all of our systems. But that is not our current problem. After we got everything up and running again, we s

Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
nd data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: > Hey Christian, > > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: >>

Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
think > (I'm yet learning ceph) that this will make different pgs for each pool, > also different OSDs, may be this way you can overcome the issue. > > Cheers > Eneko > > On 30/12/14 12:17, Christian Eichelmann wrote: >> Hi Nico and all others who answered, &g

Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
able in ceph logs in the new pools image > format? > > On 30/12/14 12:31, Christian Eichelmann wrote: >> Hi Eneko, >> >> I was trying a rbd cp before, but that was haning as well. But I >> couldn't find out if the source image was causing the hang or the >

Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Christian Eichelmann
ong Ceph will take to fully recover from > a disk or host failure by testing it with load. Your setup might not be > robust if it hasn't the available disk space or the speed needed to > recover quickly from such a failure. > > Lionel > _

[ceph-users] Documentation of ceph pg query

2015-01-09 Thread Christian Eichelmann
Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg query. The problem is that I can find some informations about what some of the sections mean, but mostly I can on

[ceph-users] Placementgroups stuck peering

2015-01-14 Thread Christian Eichelmann
Hi all, after our cluster problems with incomplete placementgroups, we've decided to remove our pools and create new ones. This was going fine in the beginning. After adding an additional OSD server, we now have 2 PGs that are stuck in the peering state: HEALTH_WARN 2 pgs peering; 2 pgs stuck ina

[ceph-users] Behaviour of Ceph while OSDs are down

2015-01-20 Thread Christian Eichelmann
Hi all, I want to understand what Ceph does if several OSDs are down. First of our, some words to our Setup: We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server. We have a replication factor of

Re: [ceph-users] Behaviour of Ceph while OSDs are down

2015-01-21 Thread Christian Eichelmann
Sam On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum wrote: On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann wrote: Hi all, I want to understand what Ceph does if several OSDs are down. First of our, some words to our Setup: We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks.

[ceph-users] Ceph Plugin for Collectd

2014-05-14 Thread Christian Eichelmann
Hi Ceph User! I had a look at the "official" collectd fork for ceph, which is quite outdated and not compatible with the upstream version. Since this was not an option for us, I've worte a Python Plugin for Collectd, that gets all the precious informations out of the admin sockets "perf dump" com

Re: [ceph-users] visualizing a ceph cluster automatically

2014-05-16 Thread Christian Eichelmann
I have written a small and lightweight gui, which can also acts as a json rest api (for non-interactive monitoring): https://github.com/Crapworks/ceph-dash Maybe thats what you searching for. Regards, Christian Von: ceph-users [ceph-users-boun...@lists.ceph.com]