> Op 9 februari 2017 om 4:09 schreef Sage Weil :
>
>
> Hello, ceph operators...
>
> Several times in the past we've had to do some ondisk format conversion
> during upgrade which mean that the first time the ceph-osd daemon started
> after upgrade it had to spend a few minutes fixing up it's
The only issue I can think of is if there isn't a version of the clients
fully tested to work with a partially upgraded cluster or a documented
incompatibility requiring downtime. We've had upgrades where we had to
upgrade clients first and others that we had to do the clients last due to
issues wi
On Wed, Feb 8, 2017 at 5:34 PM, Marcus Furlong wrote:
> On 9 February 2017 at 09:34, Trey Palmer wrote:
> > The multisite configuration available starting in Jewel sound more
> > appropriate for your situation.
> >
> > But then you need two separate clusters, each large enough to contain
> all o
Hello, ceph operators...
Several times in the past we've had to do some ondisk format conversion
during upgrade which mean that the first time the ceph-osd daemon started
after upgrade it had to spend a few minutes fixing up it's ondisk files.
We haven't had to recently, though, and generally
On 9 February 2017 at 09:34, Trey Palmer wrote:
> The multisite configuration available starting in Jewel sound more
> appropriate for your situation.
>
> But then you need two separate clusters, each large enough to contain all of
> your objects.
On that note, is anyone aware of documentation th
+1
Ever since upgrading to 10.2.x I have been seeing a lot of issues with our ceph
cluster. I have been seeing osds down, osd servers running out of memory and
killing all ceph-osd processes. Again, 10.2.5 on 4.4.x kernel.
It seems what with every release there are more and more problems with
+1
Ever since upgrading to 10.2.x I have been seeing a lot of issues with our ceph
cluster. I have been seeing osds down, osd servers running out of memory and
killing all ceph-osd processes. Again, 10.2.5 on 4.4.x kernel.
It seems what with every release there are more and more problems with
The multisite configuration available starting in Jewel sound more
appropriate for your situation.
But then you need two separate clusters, each large enough to contain all
of your objects.
-- Trey
On Tue, Feb 7, 2017 at 12:17 PM, Daniel Picolli Biazus
wrote:
> Hi Guys,
>
> I have been plan
On Wed, Feb 8, 2017 at 9:13 PM, Tracy Reed wrote:
> On Wed, Feb 08, 2017 at 10:57:38AM PST, Shinobu Kinjo spake thusly:
>> If you would be able to reproduce the issue intentionally under
>> particular condition which I have no idea about at the moment, it
>> would be helpful.
>
> The issue is very
Playing around with mds with a hot standby on kraken. When I fail out the
active mds manually it switches correctly to the standby i.e. ceph mds fail
Noticed that when I have two mds servers and I shutdown the active mds
server it takes 5 minutes for the standby relay to become active(Seems it's
We have alerting on our mons to notify us when the memory usage is above 80%
and go around and restart the mon services in that cluster. It is a memory
leak somewhere in the code, but the problem is so infrequent it's hard to get
good enough logs to track it down. We restart the mons in a clus
On Wed, Feb 08, 2017 at 10:57:38AM PST, Shinobu Kinjo spake thusly:
> If you would be able to reproduce the issue intentionally under
> particular condition which I have no idea about at the moment, it
> would be helpful.
The issue is very reproduceable. It hangs every time. Any install I do
with
This is the first development checkpoint release of Luminous series, the
next long time stable release. We're off to a good start to release
Luminous in the spring of '17.
Major changes from Kraken
-
* When assigning a network to the public network and not to
the clust
I have had two ceph monitor nodes generate swap space alerts this week.
Looking at the memory, I see ceph-mon using a lot of memory and most of the
swap space. My ceph nodes have 128GB mem, with 2GB swap (I know the
memory/swap ratio is odd)
When I get the alert, I see the following
root@empi
Hi,
High-concurrency backfilling or flushing a cache tier triggers it
fairly reliably.
Setting backfills to >16 and switching from hammer to jewel tunables
(which moves most of the data) will trigger this, as will going in the
opposite direction.
The nodes where we observed this most commonly ar
On Wed, Feb 8, 2017 at 8:07 PM, Dan van der Ster wrote:
> Hi,
>
> This is interesting. Do you have a bit more info about how to identify
> a server which is suffering from this problem? Is there some process
> (xfs* or kswapd?) we'll see as busy in top or iotop.
That's my question as well. If you
If you would be able to reproduce the issue intentionally under
particular condition which I have no idea about at the moment, it
would be helpful.
There were some MLs previously regarding to *similar* issue.
# google "libvirt rbd issue"
Regards,
On Tue, Feb 7, 2017 at 7:50 PM, Tracy Reed wr
Hey Greg,
Thanks for your quick responses. I have to leave the office now but I'll look
deeper into it tomorrow to try and understand what's the cause of this. I'll
try to find other peerings between these two hosts and check those OSDs' logs
for potential anomalies. I'll also have a look at an
On Wed, Feb 8, 2017 at 10:25 AM, wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug
Hi Greg,
> Yes, "bad crc" indicates that the checksums on an incoming message did
> not match what was provided — ie, the message got corrupted. You
> shouldn't try and fix that by playing around with the peering settings
> as it's not a peering bug.
> Unless there's a bug in the messaging layer c
Hi,
We've encountered this on both 4.4 and 4.8. It might have been there
earlier, but we have no data for that anymore.
While there's no causation, there's a high correlation with kernel
page allocation failures. If you see "[timestamp] : page
allocation failure: order:5,
mode:0x2082120(GFP_ATOMI
On Wed, Feb 8, 2017 at 8:17 AM, wrote:
> Hi Ceph folks,
>
> I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.
>
> After rebooting a host last night, one PG refuses to complete peering
>
> pg 1.323 is stuck inactive for 73352.498493, current state peering, last
> acting [
Hi Corentin,
I've tried that, the primary hangs when trying to injectargs so I set the
option in the config file and restarted all OSDs in the PG, it came up with:
pg 1.323 is remapped+peering, acting
[595,1391,2147483647,127,937,362,267,320,7,634,716]
Still can't query the PG, no error messag
Hello,
I already had the case, I applied the parameter
(osd_find_best_info_ignore_history_les) to all the osd that have reported the
queries blocked.
--
Cordialement,
CEO FEELB | Corentin BONNETON
cont...@feelb.io
> Le 8 févr. 2017 à 17:17, george.vasilaka...@stfc.ac.uk a écrit :
>
> Hi Ceph
Hi Ceph folks,
I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.
After rebooting a host last night, one PG refuses to complete peering
pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting
[595,1391,240,127,937,362,267,320,7,634,716]
Restartin
I'm unlikely to get back to this in a hurry with a 100% confirmation
it works (by end-to-end testing from the client perspective), but
where I got to so far looked promising so I thought I'd share. Note
that this was done on a Hammer cluster. Notes/expansions on steps
inline:
The assumption here i
Hi Daniel,
50 ms of latency is going to introduce a big performance hit though
things will still function. We did a few tests which are documented
at http://www.osris.org/performance/latency
thanks,
Ben
On Tue, Feb 7, 2017 at 12:17 PM, Daniel Picolli Biazus
wrote:
> Hi Guys,
>
> I have been pl
Alright, just redeployed Ubuntu box again. Here is what you requested
(server machine - ubcephnode, client machine - ubpayload):
ahmed@ubcephnode:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
ahmed@ubcephnode:~$ lsb_release -a
No LSB modules are available.
Distributor
Hi,
I would also be interested in if there is a way to determine if this happening.
I'm not sure if its related, but when I updated a
number of OSD nodes to Kernel 4.7 from 4.4, I started seeing lots of random
alerts from OSD's saying that other OSD's were not
responding. The load wasn't particu
Hi,
This is interesting. Do you have a bit more info about how to identify
a server which is suffering from this problem? Is there some process
(xfs* or kswapd?) we'll see as busy in top or iotop.
Also, which kernel are you using?
Cheers, Dan
On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig wr
30 matches
Mail list logo