Success! Hopefully my notes from the process will help:
In the event of multiple disk failures the cluster could lose PGs. Should this
occur it is best to attempt to restart the OSD process and have the drive
marked as up+out. Marking the drive as out will cause data to flow off the
drive to el
On Apr 7, 2015, at 7:44 PM, Francois Lafont wrote:
> Chris Kitzmiller wrote:
> I graph aggregate stats for `ceph --admin-daemon
>> /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too
>> far
>> outside of my mean latency I know to go look for the
I'm not having much luck here. Is there a possibility that the imported PGs
aren't being picked up because the MONs think that they're older than the empty
PGs I find on the up OSDs?
I feel that I'm so close to *not* losing my RBD volume because I only have two
bad PGs and I've successfully exp
On Apr 6, 2015, at 7:04 PM, Robert LeBlanc wrote:
> I see that ceph has 'ceph osd perf' that gets the latency of the OSDs.
> Is there a similar command that would provide some performance data
> about RBDs in use? I'm concerned about out ability to determine which
> RBD(s) may be "abusing" our sto
min_size
= 1 and size = 2.
> On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller wrote:
>> On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles wrote:
>>> according to your ceph osd tree capture, although the OSD reweight is set
>>> to 1, the OSD CRUSH weight is set to 0 (2n
On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles wrote:
>
> according to your ceph osd tree capture, although the OSD reweight is set to
> 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD
> a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight
>
On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles wrote:
> according to your ceph osd tree capture, although the OSD reweight is set to
> 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD
> a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight
> os
I have a cluster running 0.80.9 on Ubuntu 14.04. A couple nights ago I lost two
disks from a pool with size=2. :(
I replaced the two failed OSDs and I now have two PGs which are marked as
incomplete in an otherwise healthy cluster. Following this page (
https://ceph.com/community/incomplete-pgs
On Oct 28, 2014, at 5:20 PM, Lincoln Bryant wrote:
> Hi Greg, Loic,
>
> I think we have seen this as well (sent a mail to the list a week or so ago
> about incomplete pgs). I ended up giving up on the data and doing a
> force_create_pgs after doing a find on my OSDs and deleting the relevant pg
I have a number of PGs which are marked as incomplete. I'm at a loss for how to
go about recovering these PGs and believe they're suffering from the "lost
time" symptom. How do I recover these PGs? I'd settle for sacrificing the "lost
time" and just going with what I've got. I've lost the abilit
On Oct 22, 2014, at 8:22 PM, Craig Lewis wrote:
> Shot in the dark: try manually deep-scrubbing the PG. You could also try
> marking various osd's OUT, in an attempt to get the acting set to include
> osd.25 again, then do the deep-scrub again. That probably won't help though,
> because the pg
On Oct 22, 2014, at 7:51 PM, Craig Lewis wrote:
> On Wed, Oct 22, 2014 at 3:09 PM, Chris Kitzmiller
> wrote:
>> On Oct 22, 2014, at 1:50 PM, Craig Lewis wrote:
>>> Incomplete means "Ceph detects that a placement group is missing a
>>> necessary period of hi
> http://ceph.com/docs/master/man/8/crushtool/, just to make sure everything is
> kosher.
I checked with `crushtool --test -i crush.bin --show-bad-mappings` which showed
me errors for mappings above 6 replicas (which I'd expect with my particular
map) but nothing else. Modifying max_s
I've gotten myself into the position of having ~100 incomplete PGs. All of my
OSDs are up+in (and I've restarted them all one by one).
I was in the process of rebalancing after altering my CRUSH map when I lost an
OSD backing disk. I replaced that OSD and it seemed to be backfilling well.
Durin
On Aug 4, 2014, at 10:53 PM, Christian Balzer wrote:
> On Mon, 4 Aug 2014 15:11:39 -0400 Chris Kitzmiller wrote:
>> On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote:
>>> On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote:
>>>> I have 3 nodes
On Aug 5, 2014, at 12:43 PM, Mark Nelson wrote:
> On 08/05/2014 08:42 AM, Mariusz Gronczewski wrote:
>> On Mon, 04 Aug 2014 15:32:50 -0500, Mark Nelson
>> wrote:
>>> On 08/04/2014 03:28 PM, Chris Kitzmiller wrote:
>>>> On Aug 1, 2014, at 1:31 PM, Mariusz
On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote:
> I got weird stalling during writes, sometimes I got same write speed
> for few minutes and after some time it starts stalling with 0 MB/s for
> minutes
I'm getting very similar behavior on my cluster. My writes start well but then
just kind
On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote:
> On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote:
>
>> I have 3 nodes each running a MON and 30 OSDs.
>
> Given the HW you list below, that might be a tall order, particular CPU
> wise in certain situations
I have 3 nodes each running a MON and 30 OSDs. When I test my cluster with
either rados bench or with fio via a 10GbE client using RBD I get great initial
speeds >900MBps and I max out my 10GbE links for a while. Then, something goes
wrong the performance falters and the cluster stops responding
I found this article very interesting:
http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte
I've got Samsung 840 Pros and while I'm thinking that I wouldn't go with them
again I am interested in the fact that (in this anecdotal experiment) it seemed
>> We use /dev/disk/by-path for this reason, but we confirmed that is stable
>> for our HBAs. Maybe /dev/disk/by-something is consistent with your
>> controller.
>
> The upstart/udev scripts will handle mounting and osd id detection, at
> least on Ubuntu.
I'll caution that while the OSD will be c
I've got a 3 node cluster where ceph osd perf reports reasonable
fs_apply_latency for 2 out of 3 of my nodes (~30ms). But on the third node I've
got latencies averaging 15000+ms for all OSDs.
Running ceph 72.2 on Ubuntu 10.13. Each node has 30 HDDs with 6 SSDs for
journals. iperf reports full b
22 matches
Mail list logo