[ceph-users] Re: Combining erasure coding and replication?

2020-03-27 Thread Eugen Block

Hi Brett,

Our concern with Ceph is the cost of having three replicas. Storage  
may be cheap but I’d rather not buy ANOTHER 5pb for a third replica  
if there are ways to do this more efficiently. Site-level redundancy  
is important to us so we can’t simply create an erasure-coded volume  
across two buildings – if we lose power to a building, the entire  
array would become unavailable.


can you elaborate on that? Why is EC not an option? We have installed  
several clusters with two datacenters resilient to losing a whole dc  
(and additional disks if required). So it's basically the choice of  
the right EC profile. Or did I misunderstand something?



Zitat von Brett Randall :


Hi all

Had a fun time trying to join this list, hopefully you don’t get  
this message 3 times!


On to Ceph… We are looking at setting up our first ever Ceph cluster  
to replace Gluster as our media asset storage and production system.  
The Ceph cluster will have 5pb of usable storage. Whether we use it  
as object-storage, or put CephFS in front of it, is still TBD.


Obviously we’re keen to protect this data well. Our current Gluster  
setup utilises RAID-6 on each of the nodes and then we have a single  
replica of each brick. The Gluster bricks are split between  
buildings so that the replica is guaranteed to be in another  
premises. By doing it this way, we guarantee that we can have a  
decent number of disk or node failures (even an entire building)  
before we lose both connectivity and data.


Our concern with Ceph is the cost of having three replicas. Storage  
may be cheap but I’d rather not buy ANOTHER 5pb for a third replica  
if there are ways to do this more efficiently. Site-level redundancy  
is important to us so we can’t simply create an erasure-coded volume  
across two buildings – if we lose power to a building, the entire  
array would become unavailable. Likewise, we can’t simply have a  
single replica – our fault tolerance would drop way down on what it  
is right now.


Is there a way to use both erasure coding AND replication at the  
same time in Ceph to mimic the architecture we currently have in  
Gluster? I know we COULD just create RAID6 volumes on each node and  
use the entire volume as a single OSD, but that this is not the  
recommended way to use Ceph. So is there some other way?


Apologies if this is a nonsensical question, I’m still trying to  
wrap my head around Ceph, CRUSH maps, placement rules, volume types,  
etc etc!


TIA

Brett

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to migrate ceph-xattribs?

2020-03-27 Thread Frank Schilder
> > If automatic migration is not possible, is there at least an efficient way 
> > to
> > *find* everything with special ceph attributes?
> 
> IIRC, you can still see all these attributes by querying for the
> "ceph" xattr. Does that not work for you?

In case I misunderstand this part of your message, something like this

getfattr -d -m "ceph.*" dir

does not work any more, the xattribs are no longer discoverable. In addition, 
quota settings were never discoverable. The earlier thread 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/3HWU4DITVDF4IXDC2NETWS5E3EA4PM6Q/
 is about this.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Gregory Farnum 
Sent: 26 March 2020 18:36
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] How to migrate ceph-xattribs?

On Thu, Mar 26, 2020 at 9:24 AM Frank Schilder  wrote:
>
> De all,
>
> we are in the process of migrating a ceph file system from a 2-pool layout 
> (rep meta+ec data) to the recently recommended 3-pool layout (rep meta, per 
> primary data, ec data). As part of this, we need to migrate any ceph xattribs 
> set on files and directories. As these are no longer discoverable, how would 
> one go about this?
>
> Special cases:
>
> How to migrate quota settings?
> How to migrate dir- and file-layouts?
>
> Ideally, at least quota attributes should be transferable on the fly with 
> tools like rsync.

These are all policy decisions that seem pretty cluster-specific to
me. We hid them in the first place because exposing ceph xattrs to
rsync was breaking it horribly.

> If automatic migration is not possible, is there at least an efficient way to 
> *find* everything with special ceph attributes?

IIRC, you can still see all these attributes by querying for the
"ceph" xattr. Does that not work for you?

>
> Thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Combining erasure coding and replication?

2020-03-27 Thread Frank Schilder
Dear Eugen,

I guess what you are suggesting is something like k+m with m>=k+2, for example 
k=4, m=6. Then, one can distribute 5 shards per DC and sustain the loss of an 
entire DC while still having full access to redundant storage.

Now, a long time ago I was in a lecture about error-correcting codes 
(Reed-Solomon codes). From what I remember, the computational complexity of 
these codes explodes at least exponentially with m. Out of curiosity, how does 
m>3 perform in practice? What's the CPU requirement per OSD?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 27 March 2020 08:33:45
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Combining erasure coding and replication?

Hi Brett,

> Our concern with Ceph is the cost of having three replicas. Storage
> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
> if there are ways to do this more efficiently. Site-level redundancy
> is important to us so we can’t simply create an erasure-coded volume
> across two buildings – if we lose power to a building, the entire
> array would become unavailable.

can you elaborate on that? Why is EC not an option? We have installed
several clusters with two datacenters resilient to losing a whole dc
(and additional disks if required). So it's basically the choice of
the right EC profile. Or did I misunderstand something?


Zitat von Brett Randall :

> Hi all
>
> Had a fun time trying to join this list, hopefully you don’t get
> this message 3 times!
>
> On to Ceph… We are looking at setting up our first ever Ceph cluster
> to replace Gluster as our media asset storage and production system.
> The Ceph cluster will have 5pb of usable storage. Whether we use it
> as object-storage, or put CephFS in front of it, is still TBD.
>
> Obviously we’re keen to protect this data well. Our current Gluster
> setup utilises RAID-6 on each of the nodes and then we have a single
> replica of each brick. The Gluster bricks are split between
> buildings so that the replica is guaranteed to be in another
> premises. By doing it this way, we guarantee that we can have a
> decent number of disk or node failures (even an entire building)
> before we lose both connectivity and data.
>
> Our concern with Ceph is the cost of having three replicas. Storage
> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
> if there are ways to do this more efficiently. Site-level redundancy
> is important to us so we can’t simply create an erasure-coded volume
> across two buildings – if we lose power to a building, the entire
> array would become unavailable. Likewise, we can’t simply have a
> single replica – our fault tolerance would drop way down on what it
> is right now.
>
> Is there a way to use both erasure coding AND replication at the
> same time in Ceph to mimic the architecture we currently have in
> Gluster? I know we COULD just create RAID6 volumes on each node and
> use the entire volume as a single OSD, but that this is not the
> recommended way to use Ceph. So is there some other way?
>
> Apologies if this is a nonsensical question, I’m still trying to
> wrap my head around Ceph, CRUSH maps, placement rules, volume types,
> etc etc!
>
> TIA
>
> Brett
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Janek Bevendorff
Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.


On 26/03/2020 18:10, Paul Choi wrote:
> I won't speculate more into the MDS's stability, but I do wonder about
> the same thing.
> There is one file served by the MDS that would cause the ceph-fuse
> client to hang. It was a file that many people in the company relied
> on for data updates, so very noticeable. The only fix was to fail over
> the MDS.
>
> Since the free disk space dropped, I haven't heard anyone complain...
> 
>
> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>  > wrote:
>
> If there is actually a connection, then it's no wonder our MDS
> kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>
>
> On 26/03/2020 17:32, Paul Choi wrote:
>> I can't quite explain what happened, but the Prometheus endpoint
>> became stable after the free disk space for the largest pool went
>> substantially lower than 1PB.
>> I wonder if there's some metric that exceeds the maximum size for
>> some int, double, etc?
>>
>> -Paul
>>
>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>> > > wrote:
>>
>> I haven't seen any MGR hangs so far since I disabled the
>> prometheus
>> module. It seems like the module is not only slow, but kills
>> the whole
>> MGR when the cluster is sufficiently large, so these two
>> issues are most
>> likely connected. The issue has become much, much worse with
>> 14.2.8.
>>
>>
>> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> > I am running the very latest version of Nautilus. I will
>> try setting up
>> > an external exporter today and see if that fixes anything.
>> Our cluster
>> > is somewhat large-ish with 1248 OSDs, so I expect stat
>> collection to
>> > take "some" time, but it definitely shouldn't crush the
>> MGRs all the time.
>> >
>> > On 21/03/2020 02:33, Paul Choi wrote:
>> >> Hi Janek,
>> >>
>> >> What version of Ceph are you using?
>> >> We also have a much smaller cluster running Nautilus, with
>> no MDS. No
>> >> Prometheus issues there.
>> >> I won't speculate further than this but perhaps Nautilus
>> doesn't have
>> >> the same issue as Mimic?
>> >>
>> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >> > 
>> >> > >> wrote:
>> >>
>> >>     I think this is related to my previous post to this
>> list about MGRs
>> >>     failing regularly and being overall quite slow to
>> respond. The problem
>> >>     has existed before, but the new version has made it
>> way worse. My MGRs
>> >>     keep dyring every few hours and need to be restarted.
>> the Promtheus
>> >>     plugin works, but it's pretty slow and so is the
>> dashboard.
>> >>     Unfortunately, nobody seems to have a solution for
>> this and I
>> >>     wonder why
>> >>     not more people are complaining about this problem.
>> >>
>> >>
>> >>     On 20/03/2020 19:30, Paul Choi wrote:
>> >>     > If I "curl http://localhost:9283/metrics"; and wait
>> sufficiently long
>> >>     > enough, I get this - says "No MON connection". But
>> the mons are
>> >>     health and
>> >>     > the cluster is functioning fine.
>> >>     > That said, the mons' rocksdb sizes are fairly big
>> because
>> >>     there's lots of
>> >>     > rebalancing going on. The Prometheus endpoint
>> hanging seems to
>> >>     happen
>> >>     > regardless of the mon size anyhow.
>> >>     >
>> >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >
>> >>     > # fg
>> >>     > curl -H "Connection: close"
>> http://localhost:9283/metrics
>> >>     > > >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> >>     >
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> >>     > 
>> >>     > 
>> >>     >     
>> >>     >    

[ceph-users] Re: Combining erasure coding and replication?

2020-03-27 Thread Eugen Block

Hi,

I guess what you are suggesting is something like k+m with m>=k+2,  
for example k=4, m=6. Then, one can distribute 5 shards per DC and  
sustain the loss of an entire DC while still having full access to  
redundant storage.


that's exactly what I mean, yes.

Now, a long time ago I was in a lecture about error-correcting codes  
(Reed-Solomon codes). From what I remember, the computational  
complexity of these codes explodes at least exponentially with m.  
Out of curiosity, how does m>3 perform in practice? What's the CPU  
requirement per OSD?


Such a setup usually would be considered for archiving purposes so the  
performance requirements aren't very high, but so far we haven't heard  
any complaints performance-wise.

I don't have details on CPU requirements at hand right now.

Regards,
Eugen


Zitat von Frank Schilder :


Dear Eugen,

I guess what you are suggesting is something like k+m with m>=k+2,  
for example k=4, m=6. Then, one can distribute 5 shards per DC and  
sustain the loss of an entire DC while still having full access to  
redundant storage.


Now, a long time ago I was in a lecture about error-correcting codes  
(Reed-Solomon codes). From what I remember, the computational  
complexity of these codes explodes at least exponentially with m.  
Out of curiosity, how does m>3 perform in practice? What's the CPU  
requirement per OSD?


Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 27 March 2020 08:33:45
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Combining erasure coding and replication?

Hi Brett,


Our concern with Ceph is the cost of having three replicas. Storage
may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
if there are ways to do this more efficiently. Site-level redundancy
is important to us so we can’t simply create an erasure-coded volume
across two buildings – if we lose power to a building, the entire
array would become unavailable.


can you elaborate on that? Why is EC not an option? We have installed
several clusters with two datacenters resilient to losing a whole dc
(and additional disks if required). So it's basically the choice of
the right EC profile. Or did I misunderstand something?


Zitat von Brett Randall :


Hi all

Had a fun time trying to join this list, hopefully you don’t get
this message 3 times!

On to Ceph… We are looking at setting up our first ever Ceph cluster
to replace Gluster as our media asset storage and production system.
The Ceph cluster will have 5pb of usable storage. Whether we use it
as object-storage, or put CephFS in front of it, is still TBD.

Obviously we’re keen to protect this data well. Our current Gluster
setup utilises RAID-6 on each of the nodes and then we have a single
replica of each brick. The Gluster bricks are split between
buildings so that the replica is guaranteed to be in another
premises. By doing it this way, we guarantee that we can have a
decent number of disk or node failures (even an entire building)
before we lose both connectivity and data.

Our concern with Ceph is the cost of having three replicas. Storage
may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
if there are ways to do this more efficiently. Site-level redundancy
is important to us so we can’t simply create an erasure-coded volume
across two buildings – if we lose power to a building, the entire
array would become unavailable. Likewise, we can’t simply have a
single replica – our fault tolerance would drop way down on what it
is right now.

Is there a way to use both erasure coding AND replication at the
same time in Ceph to mimic the architecture we currently have in
Gluster? I know we COULD just create RAID6 volumes on each node and
use the entire volume as a single OSD, but that this is not the
recommended way to use Ceph. So is there some other way?

Apologies if this is a nonsensical question, I’m still trying to
wrap my head around Ceph, CRUSH maps, placement rules, volume types,
etc etc!

TIA

Brett

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Combining erasure coding and replication?

2020-03-27 Thread Lars Täuber
Hi Brett,

I'm far from being an expert, but you may consider rbd-mirroring between 
EC-pools.

Cheers,
Lars

Am Fri, 27 Mar 2020 06:28:02 +
schrieb Brett Randall :

> Hi all
> 
> Had a fun time trying to join this list, hopefully you don’t get this message 
> 3 times!
> 
> On to Ceph… We are looking at setting up our first ever Ceph cluster to 
> replace Gluster as our media asset storage and production system. The Ceph 
> cluster will have 5pb of usable storage. Whether we use it as object-storage, 
> or put CephFS in front of it, is still TBD.
> 
> Obviously we’re keen to protect this data well. Our current Gluster setup 
> utilises RAID-6 on each of the nodes and then we have a single replica of 
> each brick. The Gluster bricks are split between buildings so that the 
> replica is guaranteed to be in another premises. By doing it this way, we 
> guarantee that we can have a decent number of disk or node failures (even an 
> entire building) before we lose both connectivity and data.
> 
> Our concern with Ceph is the cost of having three replicas. Storage may be 
> cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are 
> ways to do this more efficiently. Site-level redundancy is important to us so 
> we can’t simply create an erasure-coded volume across two buildings – if we 
> lose power to a building, the entire array would become unavailable. 
> Likewise, we can’t simply have a single replica – our fault tolerance would 
> drop way down on what it is right now.
> 
> Is there a way to use both erasure coding AND replication at the same time in 
> Ceph to mimic the architecture we currently have in Gluster? I know we COULD 
> just create RAID6 volumes on each node and use the entire volume as a single 
> OSD, but that this is not the recommended way to use Ceph. So is there some 
> other way?
> 
> Apologies if this is a nonsensical question, I’m still trying to wrap my head 
> around Ceph, CRUSH maps, placement rules, volume types, etc etc!
> 
> TIA
> 
> Brett
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] BlueStore and checksum

2020-03-27 Thread Priya Sehgal
Hi,
I am trying to find out whether Ceph has a method to detect silent
corruption such as bit rot. I came across this text in a book - "Mastering
Ceph : Infrastructure Storage Solution with the latest Ceph release" by
Nick Fisk -


Luminous release of Ceph employs ZFS-like ability to checksum data at every
read. BlueStore calculates and stores the crc32 checksum of any data that
is written. On each read request, BlueStore reads the checksum and compares
it with the data read from the device. If there is a mismatch, BlueStore
will report read error and repair damage. Ceph will then retry the read
from another OSD holding the object.

I have two questions:
1. If there is a mismatch will the ceph user (who initiated GET) receive
any error message and he has to retry OR will this error be auto-corrected
by Ceph and the data would be returned from the other OSD (user will never
know what went wrong underneath?
2. Does this feature work for Multi-part upload objects also?

-- 
Regards,
Priya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Combining erasure coding and replication?

2020-03-27 Thread Simon Oosthoek
On 27/03/2020 09:56, Eugen Block wrote:
> Hi,
> 
>> I guess what you are suggesting is something like k+m with m>=k+2, for
>> example k=4, m=6. Then, one can distribute 5 shards per DC and sustain
>> the loss of an entire DC while still having full access to redundant
>> storage.
> 
> that's exactly what I mean, yes.

We have an EC pool of 5+7, which works that way. Currently we have no
demand for it, but it should do the job.

Cheers

/Simon

> 
>> Now, a long time ago I was in a lecture about error-correcting codes
>> (Reed-Solomon codes). From what I remember, the computational
>> complexity of these codes explodes at least exponentially with m. Out
>> of curiosity, how does m>3 perform in practice? What's the CPU
>> requirement per OSD?
> 
> Such a setup usually would be considered for archiving purposes so the
> performance requirements aren't very high, but so far we haven't heard
> any complaints performance-wise.
> I don't have details on CPU requirements at hand right now.
> 
> Regards,
> Eugen
> 
> 
> Zitat von Frank Schilder :
> 
>> Dear Eugen,
>>
>> I guess what you are suggesting is something like k+m with m>=k+2, for
>> example k=4, m=6. Then, one can distribute 5 shards per DC and sustain
>> the loss of an entire DC while still having full access to redundant
>> storage.
>>
>> Now, a long time ago I was in a lecture about error-correcting codes
>> (Reed-Solomon codes). From what I remember, the computational
>> complexity of these codes explodes at least exponentially with m. Out
>> of curiosity, how does m>3 perform in practice? What's the CPU
>> requirement per OSD?
>>
>> Best regards,
>>
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Eugen Block 
>> Sent: 27 March 2020 08:33:45
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: Combining erasure coding and replication?
>>
>> Hi Brett,
>>
>>> Our concern with Ceph is the cost of having three replicas. Storage
>>> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
>>> if there are ways to do this more efficiently. Site-level redundancy
>>> is important to us so we can’t simply create an erasure-coded volume
>>> across two buildings – if we lose power to a building, the entire
>>> array would become unavailable.
>>
>> can you elaborate on that? Why is EC not an option? We have installed
>> several clusters with two datacenters resilient to losing a whole dc
>> (and additional disks if required). So it's basically the choice of
>> the right EC profile. Or did I misunderstand something?
>>
>>
>> Zitat von Brett Randall :
>>
>>> Hi all
>>>
>>> Had a fun time trying to join this list, hopefully you don’t get
>>> this message 3 times!
>>>
>>> On to Ceph… We are looking at setting up our first ever Ceph cluster
>>> to replace Gluster as our media asset storage and production system.
>>> The Ceph cluster will have 5pb of usable storage. Whether we use it
>>> as object-storage, or put CephFS in front of it, is still TBD.
>>>
>>> Obviously we’re keen to protect this data well. Our current Gluster
>>> setup utilises RAID-6 on each of the nodes and then we have a single
>>> replica of each brick. The Gluster bricks are split between
>>> buildings so that the replica is guaranteed to be in another
>>> premises. By doing it this way, we guarantee that we can have a
>>> decent number of disk or node failures (even an entire building)
>>> before we lose both connectivity and data.
>>>
>>> Our concern with Ceph is the cost of having three replicas. Storage
>>> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
>>> if there are ways to do this more efficiently. Site-level redundancy
>>> is important to us so we can’t simply create an erasure-coded volume
>>> across two buildings – if we lose power to a building, the entire
>>> array would become unavailable. Likewise, we can’t simply have a
>>> single replica – our fault tolerance would drop way down on what it
>>> is right now.
>>>
>>> Is there a way to use both erasure coding AND replication at the
>>> same time in Ceph to mimic the architecture we currently have in
>>> Gluster? I know we COULD just create RAID6 volumes on each node and
>>> use the entire volume as a single OSD, but that this is not the
>>> recommended way to use Ceph. So is there some other way?
>>>
>>> Apologies if this is a nonsensical question, I’m still trying to
>>> wrap my head around Ceph, CRUSH maps, placement rules, volume types,
>>> etc etc!
>>>
>>> TIA
>>>
>>> Brett
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To u

[ceph-users] [ceph][nautilus] error initalizing secondary zone

2020-03-27 Thread Ignazio Cassano
Hello , I am trying to initializing the secondary zone pulling the realm
define in the primary zone:

radosgw-admin realm pull --rgw-realm=nivola --url=http://10.102.184.190:8080
--access-key=access --secret=secret

The following errors appears:

request failed: (16) Device or resource busy

Could you help me please ?

Ignazio
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ceph][nautilus] error initalizing secondary zone

2020-03-27 Thread Ignazio Cassano
I am sorry.
The problem was the http_proxy
Ignazio

Il giorno ven 27 mar 2020 alle ore 11:24 Ignazio Cassano <
ignaziocass...@gmail.com> ha scritto:

> Hello , I am trying to initializing the secondary zone pulling the realm
> define in the primary zone:
>
> radosgw-admin realm pull --rgw-realm=nivola --url=
> http://10.102.184.190:8080 --access-key=access --secret=secret
>
> The following errors appears:
>
> request failed: (16) Device or resource busy
>
> Could you help me please ?
>
> Ignazio
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] samba ceph-vfs and scrubbing interval

2020-03-27 Thread Marco Savoca
Hi all,

i‘m running a 3 node ceph cluster setup with collocated mons and mds for 
actually 3 filesystems at home since mimic. I’m planning to downgrade to one FS 
and use RBD in the future, but this is another story. I’m using the cluster as 
cold storage on spindles with EC-pools for archive purposes. The cluster 
usually does not run 24/7. I actually managed to upgrade to octopus without 
problems yesterday. So first of all: great job with the release. 

Now I have a little problem and a general question to address.

I have tried to share the CephFS via samba and the ceph-vfs module but I could 
not manage to get write access (read access is not a problem) to the share 
(even with the admin key). When I share the mounted path (kernel module or 
fuser mount) instead as usual there are no problems at all.  Is ceph-vfs 
generally read only and I missed this point? Furthermore I suppose, that there 
is no possibility to choose between the different mds namespaces, right?

Now the general question. Since the cluster does not run 24/7 as stated and is 
turned on perhaps once a week for a couple of hours on demand, what are 
reasonable settings for the scrubbing intervals? As I said, the storage is cold 
and there is mostly read i/o. The archiving process adds approximately 0.5 % of 
new data of the cluster’s total storage capacity. 

Stay healthy and regards,

Marco Savoca



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Move on cephfs not O(1)?

2020-03-27 Thread Jeff Layton
On Thu, 2020-03-26 at 10:32 -0700, Gregory Farnum wrote:
> On Thu, Mar 26, 2020 at 9:13 AM Frank Schilder  wrote:
> > Dear all,
> > 
> > yes, this is it, quotas. In the structure A/B/ there was a quota set on A. 
> > Hence, B was moved out of this zone and this does indeed change mv to be a 
> > cp+rm.
> > 
> > The obvious follow-up. What is the procedure for properly moving data as an 
> > administrator? Do I really need to unset quotas, do the move and set quotas 
> > back again?
> 
> Ah-ha, this is a difference between the userspace and kernel clients.
> :( The kernel client always returns EXDEV if crossing "quota realms"
> (different quota roots). I'm not sure why it behaves that way as
> userspace is different:
> * If fhere is a quota in the target directory, and
> * If the target directory's existing data, plus the file(s) being
> moved, exceed the target directory's quota,
> then userspace returns EXDEV/EDQUOT. This seems like the right kernel
> behavior as well...
> 
> Jeff, is that a known issue for some reason? Should we make a new bug? :)
> 

(cc'ing Luis since he wrote cafe21a4fb3075, which added this check)

In ceph_rename, we have this, which is probably what you're hitting:

/* don't allow cross-quota renames */   
if ((old_dir != new_dir) && 
(!ceph_quota_is_same_realm(old_dir, new_dir)))  
return -EXDEV;  

That does seem to just check whether the realms are the same and doesn't
actually check the space in them.

I haven't studied this in detail, but it may be hard to ensure that we
won't exceed the quota in a race-free way. How do you ensure that
another thread doesn't do something that would exceed the quota just as
you issue the rename request?

This is fairly trivial in the userland client since it does everything
under the BCM (Big Client Mutex), but won't be in the kernel client.

Opening a bug for this won't hurt, but it may not be simple to
implement.
-- 
Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: samba ceph-vfs and scrubbing interval

2020-03-27 Thread Jeff Layton
On Fri, 2020-03-27 at 12:00 +0100, Marco Savoca wrote:
> Hi all,
> 
> i‘m running a 3 node ceph cluster setup with collocated mons and mds
> for actually 3 filesystems at home since mimic. I’m planning to
> downgrade to one FS and use RBD in the future, but this is another
> story. I’m using the cluster as cold storage on spindles with EC-pools 
> for archive purposes. The cluster usually does not run 24/7. I
> actually managed to upgrade to octopus without problems yesterday. So
> first of all: great job with the release. 
> 
> Now I have a little problem and a general question to address.
> 
> I have tried to share the CephFS via samba and the ceph-vfs module but
> I could not manage to get write access (read access is not a problem)
> to the share (even with the admin key). When I share the mounted path
> (kernel module or fuser mount) instead as usual there are no problems
> at all.  Is ceph-vfs generally read only and I missed this point? 

No. I haven't tested it in some time, but it does allow clients to
write. When you say you can't get write access, what are you doing to
test this, and what error are you getting back?

> Furthermore I suppose, that there is no possibility to choose between
> the different mds namespaces, right?
> 

Yeah, doesn't look like anyone has added that. That would probably be
pretty easy to add, though it would take a little while to trickle out
to the distros.

> Now the general question. Since the cluster does not run 24/7 as
> stated and is turned on perhaps once a week for a couple of hours on
> demand, what are reasonable settings for the scrubbing intervals? As I
> said, the storage is cold and there is mostly read i/o. The archiving
> process adds approximately 0.5 % of new data of the cluster’s total
> storage capacity. 

-- 
Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] fast luminous -> nautilus -> octopus upgrade could lead to assertion failure on OSD

2020-03-27 Thread kefu chai
hi folks,

if you are upgrading from luminous to octopus, or you plan to do so,
please read on.

in octopus, osd will crash if it processes an osdmap whose
require_osd_release flag is still luminous.

this only happens if a cluster upgrades very quickly from luminous to
nautilus and to octopus. in this case, there are good chances that an
octopus OSD will need to consume osdmaps which were created back in
luminous. because we assumed that ceph did not ugprade across major
releases, in octopus, OSD will panic at seeing osdmap from luminous.
this is a known bug[0], and already fixed in master. and the next
octopus release will include the fix to address this issue. as a
workaround, you need to wait a while after running

ceph osd require-osd-release nautilus

and optionally inject lots of osdmaps into cluster to ensure that the
old luminous osd maps are trimmed:

for i in `seq 500`; do
  ceph osd blacklist add 192.168.0.1
  ceph osd blacklist rm 192.168.0.1
done

after the whole cluster are active+clean, then upgrade to octopus.

happy upgrading!


cheers,

--
[0] https://tracker.ceph.com/issues/44759

-- 
Regards
Kefu Chai
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help: corrupt pg

2020-03-27 Thread Jake Grimmett

Hi Greg,

Yes, this was caused by a chain of event. As a cautionary tale, the main 
ones were:



1) minor nautilus release upgrade, followed by a rolling node restart 
script that mistakenly relied on "ceph -s" for cluster health info,


i.e. it didn't wait for the cluster to return to health before moving 
onto restarting the next node, which caused...


2) several pgs to have "unfound object", attempted fix was to " 
mark_unfound_lost delete", which caused


3) primary osd (of the pg) to crash with: PrimaryLogPG.cc: 11550: FAILED 
ceph_assert(head_obc)


4) only fix we found for the primary pg crashing, was to destroy the 
primary osd, and wait for the pg to recover. This worked, or so we thought,


5) however looking at logs, one pg was in 
"active+recovery_unfound+undersized+degraded+remapped", I think I may 
have "mark_unfound_lost delete" this pg


6) after destroying the primary osd for this pg (oops!) the pg then went 
"inactive"


7) to restore access we set "ceph osd pool set ec82pool min_size 8" (on 
an EC 8+2 pool), at which point


8) the new primary OSD crashed with  FAILED 
ceph_assert(clone_size.count(clone)) leaving us with a pg in a very bad 
state...



I will see if we can buy some consulting time, the alternative is 
several weeks of rsync.


Many thanks again for your advice, it's very much appreciated,

Jake

On 26/03/2020 17:21, Gregory Farnum wrote:

On Wed, Mar 25, 2020 at 5:19 AM Jake Grimmett  wrote:

Dear All,

We are "in a bit of a pickle"...

No reply to my message (23/03/2020),  subject  "OSD: FAILED
ceph_assert(clone_size.count(clone))"

So I'm presuming it's not possible to recover the crashed OSD

>From your later email it sounds like this is part of a chain of events
that included telling the OSDs to deliberately work around some
known-old-or-wrong state, so while I wouldn't say it's impossible to
fix it's probably not simple and may require buying some consulting
from one of the teams that do that. I don't think I've seen these
errors before anywhere, certainly.


This is bad news, as one pg may be lost, (we are using EC 8+2, pg dump
shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )

Without this pg we have 1.8PB of broken cephfs.

I could rebuild the cluster from scratch, but this means no user backups
for a couple of weeks.

The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data
(replicated NVMe metdata pool) and is running Nautilus 14.2.8

Clearly, it would be nicer if we could fix the OSD, but if this isn't
possible, can someone confirm that the right procedure to recover from a
corrupt pg is:

1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is "rm"
fine?
4) destroy the bad pg
# ceph osd  force-create-pg 5.750
5) Copy the missing files back with rsync or similar...

This sounds about right.
Keep in mind that the PG will just be a random fraction of the
default-4MB objects in CephFS, so you will hit a huge proportion of
large files. As I assume this is a data pool, keep in mind that any
large files will just show up as a 4MB hole where the missing object
is — this may be preferable to reverting the whole file, or let you
only copy in the missing data if they are logs, or whatever. And you
can use the CephFS metadata to determine if the file in backup is
up-to-date or not.
-Greg


a better "recipe" or other advice gratefully received,

best regards,
Jake




Note: I am working from home until further notice.

For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Note: I am working from home until further notice.
For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueStore and checksum

2020-03-27 Thread Nathan Fish
The error is silently corrected and the correct data rewritten to the
bad sector. There may be a slight latency increase on the read.
The checksumming is implemented at the Bluestore layer, what you are
storing makes no difference.

On Fri, Mar 27, 2020 at 5:12 AM Priya Sehgal  wrote:
>
> Hi,
> I am trying to find out whether Ceph has a method to detect silent
> corruption such as bit rot. I came across this text in a book - "Mastering
> Ceph : Infrastructure Storage Solution with the latest Ceph release" by
> Nick Fisk -
>
>
> Luminous release of Ceph employs ZFS-like ability to checksum data at every
> read. BlueStore calculates and stores the crc32 checksum of any data that
> is written. On each read request, BlueStore reads the checksum and compares
> it with the data read from the device. If there is a mismatch, BlueStore
> will report read error and repair damage. Ceph will then retry the read
> from another OSD holding the object.
>
> I have two questions:
> 1. If there is a mismatch will the ceph user (who initiated GET) receive
> any error message and he has to retry OR will this error be auto-corrected
> by Ceph and the data would be returned from the other OSD (user will never
> know what went wrong underneath?
> 2. Does this feature work for Multi-part upload objects also?
>
> --
> Regards,
> Priya
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v15.2.0 Octopus released

2020-03-27 Thread Sage Weil
One word of caution: there is one known upgrade issue if you

 - upgrade from luminous to nautilus, and then
 - run nautilus for a very short period of time (hours), and then
 - upgrade from nautilus to octopus

that prevents OSDs from starting.  We have a fix that will be in 15.2.1, 
but until that is out, I would recommend against the double-upgrade.  If 
you have been running nautilus for a while (days) you should be fine.

sage


https://tracker.ceph.com/issues/44770
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Move on cephfs not O(1)?

2020-03-27 Thread Jeff Layton
On Fri, 2020-03-27 at 07:36 -0400, Jeff Layton wrote:
> On Thu, 2020-03-26 at 10:32 -0700, Gregory Farnum wrote:
> > On Thu, Mar 26, 2020 at 9:13 AM Frank Schilder  wrote:
> > > Dear all,
> > > 
> > > yes, this is it, quotas. In the structure A/B/ there was a quota set on 
> > > A. Hence, B was moved out of this zone and this does indeed change mv to 
> > > be a cp+rm.
> > > 
> > > The obvious follow-up. What is the procedure for properly moving data as 
> > > an administrator? Do I really need to unset quotas, do the move and set 
> > > quotas back again?
> > 
> > Ah-ha, this is a difference between the userspace and kernel clients.
> > :( The kernel client always returns EXDEV if crossing "quota realms"
> > (different quota roots). I'm not sure why it behaves that way as
> > userspace is different:
> > * If fhere is a quota in the target directory, and
> > * If the target directory's existing data, plus the file(s) being
> > moved, exceed the target directory's quota,
> > then userspace returns EXDEV/EDQUOT. This seems like the right kernel
> > behavior as well...
> > 
> > Jeff, is that a known issue for some reason? Should we make a new bug? :)
> > 
> 
> (cc'ing Luis since he wrote cafe21a4fb3075, which added this check)
> 
> In ceph_rename, we have this, which is probably what you're hitting:
> 
> /* don't allow cross-quota renames */ 
>   
> if ((old_dir != new_dir) &&   
>   
> (!ceph_quota_is_same_realm(old_dir, new_dir)))
>   
> return -EXDEV;
>   
> 
> That does seem to just check whether the realms are the same and doesn't
> actually check the space in them.
> 
> I haven't studied this in detail, but it may be hard to ensure that we
> won't exceed the quota in a race-free way. How do you ensure that
> another thread doesn't do something that would exceed the quota just as
> you issue the rename request?
> 
> This is fairly trivial in the userland client since it does everything
> under the BCM (Big Client Mutex), but won't be in the kernel client.
> 
> Opening a bug for this won't hurt, but it may not be simple to
> implement.

Ok, I gave it a harder look and it doesn't look too bad, given that
quotas are sort of "sloppy" anyway. We could just do a check that we
won't exceed quota on the destination directory. The catch is that there
may be other details to deal with too in moving an existing inode under
a different snaprealm.

You're welcome to open a tracker bug for this, but I'm not sure who will
take on that project and when.

Thanks, 
-- 
Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread shubjero
I've reported stability problems with ceph-mgr w/ prometheus plugin
enabled on all versions we ran in production which were several
versions of Luminous and Mimic. Our solution was to disable the
prometheus exporter. I am using Zabbix instead. Our cluster is 1404
OSD's in size with about 9PB raw with around 35% utilization.

On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
 wrote:
>
> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
> failing constantly due to the prometheus module doing something funny.
>
>
> On 26/03/2020 18:10, Paul Choi wrote:
> > I won't speculate more into the MDS's stability, but I do wonder about
> > the same thing.
> > There is one file served by the MDS that would cause the ceph-fuse
> > client to hang. It was a file that many people in the company relied
> > on for data updates, so very noticeable. The only fix was to fail over
> > the MDS.
> >
> > Since the free disk space dropped, I haven't heard anyone complain...
> > 
> >
> > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> >  > > wrote:
> >
> > If there is actually a connection, then it's no wonder our MDS
> > kept crashing. Our Ceph has 9.2PiB of available space at the moment.
> >
> >
> > On 26/03/2020 17:32, Paul Choi wrote:
> >> I can't quite explain what happened, but the Prometheus endpoint
> >> became stable after the free disk space for the largest pool went
> >> substantially lower than 1PB.
> >> I wonder if there's some metric that exceeds the maximum size for
> >> some int, double, etc?
> >>
> >> -Paul
> >>
> >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> >>  >> > wrote:
> >>
> >> I haven't seen any MGR hangs so far since I disabled the
> >> prometheus
> >> module. It seems like the module is not only slow, but kills
> >> the whole
> >> MGR when the cluster is sufficiently large, so these two
> >> issues are most
> >> likely connected. The issue has become much, much worse with
> >> 14.2.8.
> >>
> >>
> >> On 23/03/2020 09:00, Janek Bevendorff wrote:
> >> > I am running the very latest version of Nautilus. I will
> >> try setting up
> >> > an external exporter today and see if that fixes anything.
> >> Our cluster
> >> > is somewhat large-ish with 1248 OSDs, so I expect stat
> >> collection to
> >> > take "some" time, but it definitely shouldn't crush the
> >> MGRs all the time.
> >> >
> >> > On 21/03/2020 02:33, Paul Choi wrote:
> >> >> Hi Janek,
> >> >>
> >> >> What version of Ceph are you using?
> >> >> We also have a much smaller cluster running Nautilus, with
> >> no MDS. No
> >> >> Prometheus issues there.
> >> >> I won't speculate further than this but perhaps Nautilus
> >> doesn't have
> >> >> the same issue as Mimic?
> >> >>
> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >> >>  >> 
> >> >>  >> >> wrote:
> >> >>
> >> >> I think this is related to my previous post to this
> >> list about MGRs
> >> >> failing regularly and being overall quite slow to
> >> respond. The problem
> >> >> has existed before, but the new version has made it
> >> way worse. My MGRs
> >> >> keep dyring every few hours and need to be restarted.
> >> the Promtheus
> >> >> plugin works, but it's pretty slow and so is the
> >> dashboard.
> >> >> Unfortunately, nobody seems to have a solution for
> >> this and I
> >> >> wonder why
> >> >> not more people are complaining about this problem.
> >> >>
> >> >>
> >> >> On 20/03/2020 19:30, Paul Choi wrote:
> >> >> > If I "curl http://localhost:9283/metrics"; and wait
> >> sufficiently long
> >> >> > enough, I get this - says "No MON connection". But
> >> the mons are
> >> >> health and
> >> >> > the cluster is functioning fine.
> >> >> > That said, the mons' rocksdb sizes are fairly big
> >> because
> >> >> there's lots of
> >> >> > rebalancing going on. The Prometheus endpoint
> >> hanging seems to
> >> >> happen
> >> >> > regardless of the mon size anyhow.
> >> >> >
> >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Jarett DeAngelis
I’m actually very curious how well this is performing for you as I’ve 
definitely not seen a deployment this large. How do you use it?

> On Mar 27, 2020, at 11:47 AM, shubjero  wrote:
> 
> I've reported stability problems with ceph-mgr w/ prometheus plugin
> enabled on all versions we ran in production which were several
> versions of Luminous and Mimic. Our solution was to disable the
> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
> OSD's in size with about 9PB raw with around 35% utilization.
> 
> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>  wrote:
>> 
>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>> failing constantly due to the prometheus module doing something funny.
>> 
>> 
>> On 26/03/2020 18:10, Paul Choi wrote:
>>> I won't speculate more into the MDS's stability, but I do wonder about
>>> the same thing.
>>> There is one file served by the MDS that would cause the ceph-fuse
>>> client to hang. It was a file that many people in the company relied
>>> on for data updates, so very noticeable. The only fix was to fail over
>>> the MDS.
>>> 
>>> Since the free disk space dropped, I haven't heard anyone complain...
>>> 
>>> 
>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>> >> > wrote:
>>> 
>>>If there is actually a connection, then it's no wonder our MDS
>>>kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>> 
>>> 
>>>On 26/03/2020 17:32, Paul Choi wrote:
I can't quite explain what happened, but the Prometheus endpoint
became stable after the free disk space for the largest pool went
substantially lower than 1PB.
I wonder if there's some metric that exceeds the maximum size for
some int, double, etc?
 
-Paul
 
On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>> wrote:
 
I haven't seen any MGR hangs so far since I disabled the
prometheus
module. It seems like the module is not only slow, but kills
the whole
MGR when the cluster is sufficiently large, so these two
issues are most
likely connected. The issue has become much, much worse with
14.2.8.
 
 
On 23/03/2020 09:00, Janek Bevendorff wrote:
> I am running the very latest version of Nautilus. I will
try setting up
> an external exporter today and see if that fixes anything.
Our cluster
> is somewhat large-ish with 1248 OSDs, so I expect stat
collection to
> take "some" time, but it definitely shouldn't crush the
MGRs all the time.
> 
> On 21/03/2020 02:33, Paul Choi wrote:
>> Hi Janek,
>> 
>> What version of Ceph are you using?
>> We also have a much smaller cluster running Nautilus, with
no MDS. No
>> Prometheus issues there.
>> I won't speculate further than this but perhaps Nautilus
doesn't have
>> the same issue as Mimic?
>> 
>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >>>
>> >> wrote:
>> 
>>I think this is related to my previous post to this
list about MGRs
>>failing regularly and being overall quite slow to
respond. The problem
>>has existed before, but the new version has made it
way worse. My MGRs
>>keep dyring every few hours and need to be restarted.
the Promtheus
>>plugin works, but it's pretty slow and so is the
dashboard.
>>Unfortunately, nobody seems to have a solution for
this and I
>>wonder why
>>not more people are complaining about this problem.
>> 
>> 
>>On 20/03/2020 19:30, Paul Choi wrote:
>>> If I "curl http://localhost:9283/metrics"; and wait
sufficiently long
>>> enough, I get this - says "No MON connection". But
the mons are
>>health and
>>> the cluster is functioning fine.
>>> That said, the mons' rocksdb sizes are fairly big
because
>>there's lots of
>>> rebalancing going on. The Prometheus endpoint
hanging seems to
>>happen
>>> regardless of the mon size anyhow.
>>> 
>>>mon.woodenbox0 is 41 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox2 is 26 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox4 is 42 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox3 is 43 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox1 is 38 GiB >= mon_data_size_warn
(15 GiB)
>>> 
>>> # fg
>>> curl -H "Connection: close"

[ceph-users] Re: Identify slow ops

2020-03-27 Thread Thomas Schneider
Hi,

I have upgraded to 14.2.8 and rebooted all nodes sequentially including
all 3 MON services.
However the slow ops are still displayed with increasing block time.

If this is a (harmless) bug anymore, then I would like to understand
what's causing it.
But if this is not a bug, then I would like to know how to start analyse
the root cause, means which operations are affected, which OSDs
contribute to slow IO, etc.

Please advise how to proceed.
# ceph -s
  cluster:
    id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
    health: HEALTH_WARN
    17 daemons have recently crashed
    2263 slow ops, oldest one blocked for 183885 sec, mon.ld5505
has slow ops

  services:
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 2d)
    mgr: ld5505(active, since 2d), standbys: ld5506, ld5507
    mds: cephfs:2 {0=ld4257=up:active,1=ld5508=up:active} 2
up:standby-replay 3 up:standby
    osd: 442 osds: 441 up (since 38h), 441 in (since 38h)

  data:
    pools:   7 pools, 19628 pgs
    objects: 68.65M objects, 262 TiB
    usage:   786 TiB used, 744 TiB / 1.5 PiB avail
    pgs: 19628 active+clean

  io:
    client:   3.3 KiB/s rd, 3.1 MiB/s wr, 7 op/s rd, 25 op/s wr


Please advise how to proceed.

THX


Am 17.02.2020 um 18:31 schrieb Paul Emmerich:
> that's probably just https://tracker.ceph.com/issues/43893
> (a harmless bug)
>
> Restart the mons to get rid of the message
>
> Paul
>
> -- Paul Emmerich Looking for help with your Ceph cluster? Contact us
> at https://croit.io croit GmbH Freseniusstr. 31h 81247 München
> www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 17, 2020 at 2:59 PM
> Thomas Schneider <74cmo...@gmail.com> wrote:
>> Hi,
>>
>> the current output of ceph -s reports a warning:
>> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops
>> This time is increasing.
>>
>> root@ld3955:~# ceph -s
>>   cluster:
>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>> health: HEALTH_WARN
>> 9 daemons have recently crashed
>> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
>> has slow ops
>>
>>   services:
>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
>> mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
>> mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
>> up:standby-replay 3 up:standby
>> osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
>>
>>   data:
>> pools:   7 pools, 19628 pgs
>> objects: 65.78M objects, 251 TiB
>> usage:   753 TiB used, 779 TiB / 1.5 PiB avail
>> pgs: 19628 active+clean
>>
>>   io:
>> client:   427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
>>
>> The details are as follows:
>> root@ld3955:~# ceph health detail
>> HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
>> blocked for 347755 sec, mon.ld5505 has slow ops
>> RECENT_CRASH 9 daemons have recently crashed
>> mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
>> mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
>> mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
>> mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
>> mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
>> mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
>> mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
>> mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
>> mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
>> SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
>> slow ops
>>
>> There's no error on services (mgr, mon, osd).
>>
>> Can you please advise how to identify the root cause of this slow ops?
>>
>> THX
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Space leak in Bluestore

2020-03-27 Thread vitalif
Update on my issue. It seems it was caused by the broken compression 
which one of 14.2.x releases (ubuntu builds) probably had.


My osd versions were mixed. Five OSDs were 14.2.7, one was 14.2.4, other 
6 were 14.2.8.


I moved the same pg several times more. Space usage dropped when the pg 
was moved from 11,6,0 to 6,0,7. Then it raised again after moving the pg 
back.


Then I upgraded all OSDs and moved the PG again. Space usage dropped to 
the original point...


So now I probably have to move all PGs twice. Because I don't know which 
ones are affected...



Hi,

The cluster is all-flash (NVMe), so the removal is fast and it's in
fact pretty noticeable, even on Prometheus graphs.

Also I've logged raw space usage from `ceph -f json df`:

1) before pg rebalance started the space usage was 32724002664448 bytes
2) just before the rebalance finished it was 32883513622528 bytes
(1920 of ~120k objects misplaced) = +100 GB
3) then it started to drop. not instantly, but fast, and stopped at
32785906380800 = +58 GB to the original

I've repeated it several times. The behaviour is always the same.
First it copies the PG, then removes the old copy, but space usage
doesn't drop to the original point. It's obviously not client IO too,
it always happens exactly during the rebalance.


Hi Vitaliy,

just as a guess to verify:

a while ago I've been observed very long pool (pretty large) removal.
It took several days to complete. DB was at spinner which was one of
driver of this slow behavior.

Another one - PG removal design which enumerates up to 30 entries max
to fill single removal batch. Then execute it. Everything in a single
"thread". So the process is pretty slow for millions of objects...

During removal pool (read PGs) space was in use ad decreased slowly.
Pretty high DB volume utilization was observed.

 I assume rebalance performs PG removal as well - may be that's the
case?

Thanks,

Igor
On 3/26/2020 1:51 AM, Виталий Филиппов wrote:


Hi Igor,

I think so because
1) space usage increases after each rebalance. Even when the same pg
is moved twice (!)
2) I use 4k min_alloc_size from the beginning

One crazy hypothesis is that maybe ceph allocates space for
uncompressed objects, then compresses them and leaks
(uncompressed-compressed) space. Really crazy idea but who knows
o_O.

I already did a deep fsck, it didn't help... what else could I
check?...

26 марта 2020 г. 1:40:52 GMT+03:00, Igor Fedotov
 пишет:

Bluestore fsck/repair detect and fix leaks at Bluestore level but I
doubt your issue is here. To be honest I don't understand from the
overview why do you think that there are any leaks at all Not
sure whether this is relevant but from my experience space "leaks"
are sometimes caused by 64K allocation unit and keeping tons of
small files or massive small EC overwrites. To verify if this is
applicable you might want to inspect bluestore performance counters
(bluestore_stored vs. bluestore_allocated) to estimate your losses
due to high allocation units. Significant difference at multiple
OSDs might indicate that overhead is caused by high allocation
granularity. Compression might make this analysis not that simple
though... Thanks, Igor On 3/26/2020 1:19 AM, vita...@yourcmc.ru
wrote: I have a question regarding this problem - is it possible to
rebuild bluestore allocation metadata? I could try it to test if
it's an allocator problem... Hi. I'm experiencing some kind of a
space leak in Bluestore. I use EC, compression and snapshots. First
I thought that the leak was caused by "virtual clones" (issue
#38184). However, then I got rid of most of the snapshots, but
continued to experience the problem. I suspected something when I
added a new disk to the cluster and free space in the cluster didn't
increase (!). So to track down the issue I moved one PG (34.1a)
using upmaps from osd11,6,0 to osd6,0,7 and then back to osd11,6,0.
It ate +59 GB after the first move and +51 GB after the second. As I
understand this proves that it's not #38184. Devirtualizaton of
virtual clones couldn't eat additional space after SECOND rebalance
of the same PG. The PG has ~39000 objects, it is EC 2+1 and the
compression is enabled. Compression ratio is about ~2.7 in my setup,
so the PG should use ~90 GB raw space. Before and after moving the
PG I stopped osd0, mounted it with ceph-objectstore-tool with debug
bluestore = 20/20 and opened the 34.1a***/all directory. It seems to
dump all object extents into the log in that case. So now I have two
logs with all allocated extents for osd0 (I hope all extents are
there). I parsed both logs and added all compressed blob sizes
together ("get_ref Blob ... 0x2 -> 0x... compressed"). But they
add up to ~39 GB before first rebalance (34.1as2), ~22 GB after it
(34.1as1) and ~41 GB again after the second move (34.1as2) which
doesn't indicate a leak. But the raw space usage still exceeds
initial by a lot. So it's clear that there's a leak somewhere. What
additional details can I provide

[ceph-users] Re: v15.2.0 Octopus released

2020-03-27 Thread Mazzystr
What about the missing dependencies for octopus on el8?  (looking at yu
ceph-mgr!)

On Fri, Mar 27, 2020 at 7:15 AM Sage Weil  wrote:

> One word of caution: there is one known upgrade issue if you
>
>  - upgrade from luminous to nautilus, and then
>  - run nautilus for a very short period of time (hours), and then
>  - upgrade from nautilus to octopus
>
> that prevents OSDs from starting.  We have a fix that will be in 15.2.1,
> but until that is out, I would recommend against the double-upgrade.  If
> you have been running nautilus for a while (days) you should be fine.
>
> sage
>
>
> https://tracker.ceph.com/issues/44770
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Leave of absence...

2020-03-27 Thread Sage Weil
Hi everyone,

I am taking time off from the Ceph project and from Red Hat, starting in 
April and extending through the US election in November. I will initially 
be working with an organization focused on voter registration and turnout 
and combating voter suppression and disinformation campaigns.

During this time I will maintain some involvement in the Ceph community, 
primarily around strategic planning for Pacific and the Ceph Foundation, 
but most of my time will be focused elsewhere. 

Most decision making around Ceph will remain in the capable hands of the 
Ceph Leadership Team and component leads--I have the utmost confidence in 
their judgement and abilities.  Yehuda Sadeh and Josh Durgin will be 
filling in to provide high-level guidance where needed.

I’ll be participating in the Pacific planning meetings planned for next 
week, which will be important in kicking off development for Pacific: 

https://ceph.io/cds/ceph-developer-summit-pacific/

I am extremely proud of what we have accomplished with the Octopus 
release, and I believe the Ceph community will continue to do great things 
with Pacific!  I look forward to returning at the end of the year to help 
wrap up the release and (hopefully) get things ready for Cephalocon next 
March.

Most of all, I am excited to become engaged in another effort that I feel 
strongly about--one that will have a very real impact on my kids’ 
futures--and that will be easier to explain to lay people! :)

Thanks!
sage
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph WAL/DB disks - do they really only use 3GB, or 30Gb, or 300GB

2020-03-27 Thread victorhooi
Hi,

I'm using Intel Optane disks to provide WAL/DB capacity for my Ceph cluster 
(which is part of Proxmox - for VM hosting).

I've read that WAL/DB partitions only use either 3GB, or 30GB, or 300GB - due 
to the way that RocksDB works.

Is this true?

My current partition for WAL/DB is 145 GB - does this mean that 115Gb of that 
will be permanently wasted?

Is this behaviour documented somewhere, or is there some background, so I can 
understand a bit more about how it works?

Thanks,
Victor
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph influxDB support versus Telegraf Ceph plugin?

2020-03-27 Thread victorhooi
Hi,

I've read that Ceph has some InfluxDB reporting capabilities inbuilt 
(https://docs.ceph.com/docs/master/mgr/influx/).

However, Telegraf, which is the system reporting daemon for InfluxDB, also has 
a Ceph plugin 
(https://github.com/influxdata/telegraf/tree/master/plugins/inputs/ceph).

Just curious what people's thoughts on the two are, or what they are using in 
production?

Which is easier to deploy/maintain, have you found? Or more useful for 
alerting, or tracking performance gremlins?

Thanks,
Victor
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Move on cephfs not O(1)?

2020-03-27 Thread Gregory Farnum
Made a ticket in the Linux Kernel Client tracker:
https://tracker.ceph.com/issues/44791

I naively don't think this should be very complicated at all, except
that I recall hearing something about locking issues with quota realms
in the kclient? But the userspace client definitely doesn't have to do
anything tricky with moving things, it's pretty much a normal move
with some glue to point at a different quota realm.
-Greg

On Fri, Mar 27, 2020 at 7:34 AM Jeff Layton  wrote:
>
> On Fri, 2020-03-27 at 07:36 -0400, Jeff Layton wrote:
> > On Thu, 2020-03-26 at 10:32 -0700, Gregory Farnum wrote:
> > > On Thu, Mar 26, 2020 at 9:13 AM Frank Schilder  wrote:
> > > > Dear all,
> > > >
> > > > yes, this is it, quotas. In the structure A/B/ there was a quota set on 
> > > > A. Hence, B was moved out of this zone and this does indeed change mv 
> > > > to be a cp+rm.
> > > >
> > > > The obvious follow-up. What is the procedure for properly moving data 
> > > > as an administrator? Do I really need to unset quotas, do the move and 
> > > > set quotas back again?
> > >
> > > Ah-ha, this is a difference between the userspace and kernel clients.
> > > :( The kernel client always returns EXDEV if crossing "quota realms"
> > > (different quota roots). I'm not sure why it behaves that way as
> > > userspace is different:
> > > * If fhere is a quota in the target directory, and
> > > * If the target directory's existing data, plus the file(s) being
> > > moved, exceed the target directory's quota,
> > > then userspace returns EXDEV/EDQUOT. This seems like the right kernel
> > > behavior as well...
> > >
> > > Jeff, is that a known issue for some reason? Should we make a new bug? :)
> > >
> >
> > (cc'ing Luis since he wrote cafe21a4fb3075, which added this check)
> >
> > In ceph_rename, we have this, which is probably what you're hitting:
> >
> > /* don't allow cross-quota renames */
> > if ((old_dir != new_dir) &&
> > (!ceph_quota_is_same_realm(old_dir, new_dir)))
> > return -EXDEV;
> >
> > That does seem to just check whether the realms are the same and doesn't
> > actually check the space in them.
> >
> > I haven't studied this in detail, but it may be hard to ensure that we
> > won't exceed the quota in a race-free way. How do you ensure that
> > another thread doesn't do something that would exceed the quota just as
> > you issue the rename request?
> >
> > This is fairly trivial in the userland client since it does everything
> > under the BCM (Big Client Mutex), but won't be in the kernel client.
> >
> > Opening a bug for this won't hurt, but it may not be simple to
> > implement.
>
> Ok, I gave it a harder look and it doesn't look too bad, given that
> quotas are sort of "sloppy" anyway. We could just do a check that we
> won't exceed quota on the destination directory. The catch is that there
> may be other details to deal with too in moving an existing inode under
> a different snaprealm.
>
> You're welcome to open a tracker bug for this, but I'm not sure who will
> take on that project and when.
>
> Thanks,
> --
> Jeff Layton 
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Move on cephfs not O(1)?

2020-03-27 Thread Frank Schilder
Thanks a lot! Have a good weekend.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Gregory Farnum 
Sent: 27 March 2020 21:18:37
To: Jeff Layton
Cc: Frank Schilder; Zheng Yan; ceph-users; Luis Henriques
Subject: Re: [ceph-users] Re: Move on cephfs not O(1)?

Made a ticket in the Linux Kernel Client tracker:
https://tracker.ceph.com/issues/44791

I naively don't think this should be very complicated at all, except
that I recall hearing something about locking issues with quota realms
in the kclient? But the userspace client definitely doesn't have to do
anything tricky with moving things, it's pretty much a normal move
with some glue to point at a different quota realm.
-Greg

On Fri, Mar 27, 2020 at 7:34 AM Jeff Layton  wrote:
>
> On Fri, 2020-03-27 at 07:36 -0400, Jeff Layton wrote:
> > On Thu, 2020-03-26 at 10:32 -0700, Gregory Farnum wrote:
> > > On Thu, Mar 26, 2020 at 9:13 AM Frank Schilder  wrote:
> > > > Dear all,
> > > >
> > > > yes, this is it, quotas. In the structure A/B/ there was a quota set on 
> > > > A. Hence, B was moved out of this zone and this does indeed change mv 
> > > > to be a cp+rm.
> > > >
> > > > The obvious follow-up. What is the procedure for properly moving data 
> > > > as an administrator? Do I really need to unset quotas, do the move and 
> > > > set quotas back again?
> > >
> > > Ah-ha, this is a difference between the userspace and kernel clients.
> > > :( The kernel client always returns EXDEV if crossing "quota realms"
> > > (different quota roots). I'm not sure why it behaves that way as
> > > userspace is different:
> > > * If fhere is a quota in the target directory, and
> > > * If the target directory's existing data, plus the file(s) being
> > > moved, exceed the target directory's quota,
> > > then userspace returns EXDEV/EDQUOT. This seems like the right kernel
> > > behavior as well...
> > >
> > > Jeff, is that a known issue for some reason? Should we make a new bug? :)
> > >
> >
> > (cc'ing Luis since he wrote cafe21a4fb3075, which added this check)
> >
> > In ceph_rename, we have this, which is probably what you're hitting:
> >
> > /* don't allow cross-quota renames */
> > if ((old_dir != new_dir) &&
> > (!ceph_quota_is_same_realm(old_dir, new_dir)))
> > return -EXDEV;
> >
> > That does seem to just check whether the realms are the same and doesn't
> > actually check the space in them.
> >
> > I haven't studied this in detail, but it may be hard to ensure that we
> > won't exceed the quota in a race-free way. How do you ensure that
> > another thread doesn't do something that would exceed the quota just as
> > you issue the rename request?
> >
> > This is fairly trivial in the userland client since it does everything
> > under the BCM (Big Client Mutex), but won't be in the kernel client.
> >
> > Opening a bug for this won't hurt, but it may not be simple to
> > implement.
>
> Ok, I gave it a harder look and it doesn't look too bad, given that
> quotas are sort of "sloppy" anyway. We could just do a check that we
> won't exceed quota on the destination directory. The catch is that there
> may be other details to deal with too in moving an existing inode under
> a different snaprealm.
>
> You're welcome to open a tracker bug for this, but I'm not sure who will
> take on that project and when.
>
> Thanks,
> --
> Jeff Layton 
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io