[ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Hendrik Peyerl
Hello All,

I am having some troubles with Ceph Quotas not working on subdirectories. I am 
running with the following directory tree:

- customer
  - project
- environment
  - application1
  - application2
  - applicationx

I set a quota on environment which works perfectly fine, the client sees the 
quota and is not breaching it. The problem starts when I try to mount a 
subdirectory like application1, this directory does not have any quota at all.
Is there a possibility to set a quota for environment so that the application 
directories will not be able to go over that quota?

Client Caps:

caps: [mds] allow rw path=/customer/project/environment
caps: [mon] allow r
caps: [osd] allow rw tag cephfs data=cephfs


My Environment:

Ceph 13.2.4 on CentOS 7.6 with Kernel 4.20.3-1 for both Servers and Clients


Any help would be greatly appreciated.

Best Regards,

Hendrik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Right way to delete OSD from cluster?

2019-02-26 Thread Fyodor Ustinov
Hi!

Thank you so much!

I do not understand why, but your variant really causes only one rebalance 
compared to the "osd out".

- Original Message -
From: "Scottix" 
To: "Fyodor Ustinov" 
Cc: "ceph-users" 
Sent: Wednesday, 30 January, 2019 20:31:32
Subject: Re: [ceph-users] Right way to delete OSD from cluster?

I generally have gone the crush reweight 0 route
This way the drive can participate in the rebalance, and the rebalance
only happens once. Then you can take it out and purge.

If I am not mistaken this is the safest.

ceph osd crush reweight  0

On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> But unless after "ceph osd crush remove" I will not got the undersized 
> objects? That is, this is not the same thing as simply turning off the OSD 
> and waiting for the cluster to be restored?
>
> - Original Message -
> From: "Wido den Hollander" 
> To: "Fyodor Ustinov" , "ceph-users" 
> Sent: Wednesday, 30 January, 2019 15:05:35
> Subject: Re: [ceph-users] Right way to delete OSD from cluster?
>
> On 1/30/19 2:00 PM, Fyodor Ustinov wrote:
> > Hi!
> >
> > I thought I should first do "ceph osd out", wait for the end relocation of 
> > the misplaced objects and after that do "ceph osd purge".
> > But after "purge" the cluster starts relocation again.
> >
> > Maybe I'm doing something wrong? Then what is the correct way to delete the 
> > OSD from the cluster?
> >
>
> You are not doing anything wrong, this is the expected behavior. There
> are two CRUSH changes:
>
> - Marking it out
> - Purging it
>
> You could do:
>
> $ ceph osd crush remove osd.X
>
> Wait for all good
>
> $ ceph osd purge X
>
> The last step should then not initiate any data movement.
>
> Wido
>
> > WBR,
> > Fyodor.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
T: @Thaumion
IG: Thaumion
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph migration

2019-02-26 Thread Eugen Block

Hi,


Well, I've just reacted to all the text at the beginning of
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
including the title "the messy way". If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch.


with that I would agree. Careful planning and an installation  
following the docs should be first priority. But I would also  
encourage users to experiment with ceph before going into production.  
Dealing with failures and outages on a production cluster causes much  
more headache than on a test cluster. ;-)


If the cluster is empty anyway, I would also rather reinstall it, it  
doesn't take that much time. I just wanted to point out that there is  
a way that worked for me, although that was only a test cluster.


Regards,
Eugen


Zitat von Janne Johansson :


Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block :

I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:

- set osd noout, ensure there are no OSDs up
- Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP
ADDRESS", MONs are the only ones really
sticky with the IP
- Ensure ceph.conf has the new MON IPs and network IPs
- Start MONs with new monmap, then start OSDs

> No, certain ips will be visible in the databases, and those will  
not change.

I'm not sure where old IPs will be still visible, could you clarify
that, please?


Well, I've just reacted to all the text at the beginning of
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
including the title "the messy way". If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch. What
if you miss some part, some command gives you an error
you really aren't comfortable with, something doesn't really feel
right after doing it, then the whole lifetime of that cluster
will be followed by a small nagging feeling that it might have been
that time you followed a guide that tries to talk you out of
doing it that way, for a cluster with no data.

I think that is the wrong way to learn how to run clusters.

--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Ramana Raja
On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl  wrote: 
> 
> Hello All,
> 
> I am having some troubles with Ceph Quotas not working on subdirectories. I
> am running with the following directory tree:
> 
> - customer
>   - project
> - environment
>   - application1
>   - application2
>   - applicationx
> 
> I set a quota on environment which works perfectly fine, the client sees the
> quota and is not breaching it. The problem starts when I try to mount a
> subdirectory like application1, this directory does not have any quota at
> all.
> Is there a possibility to set a quota for environment so that the application
> directories will not be able to go over that quota?

Can you set quotas on the application directories as well?
setfattr -n ceph.quota.max_bytes -v  
/environment/application1 

> 
> Client Caps:
> 
> caps: [mds] allow rw path=/customer/project/environment
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs
> 
> 
> My Environment:
> 
> Ceph 13.2.4 on CentOS 7.6 with Kernel 4.20.3-1 for both Servers and Clients
> 
> 
> Any help would be greatly appreciated.
> 
> Best Regards,
> 
> Hendrik
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

2019-02-26 Thread Massimo Sgaravatto
On Mon, Feb 25, 2019 at 9:26 PM mart.v  wrote:

> - As far as I understand the reported 'implicated osds' are only the
> primary ones. In the log of the osds you should find also the relevant pg
> number, and with this information you can get all the involved OSDs. This
> might be useful e.g. to see if a specific OSD node is always involved. This
> was my case (a the problem was with the patch cable connecting the node)
>
>
> I can see right from the REQUEST_SLOW error log lines implicated OSDs and
> therefore I can tell which nodes are involved. It is indeed on all nodes in
> a cluster, no exception. So it cannot be linked to one specific node.
>

I am afraid I was not clear enough. Suppose that ceph health detail reports
a slow request involving osd.14

In osd.14 log I see this line:

2019-02-24 16:58:39.475740 7fe25a84d700  0 log_channel(cluster) log [WRN] :
slow request 30.328572 seconds old, received at 2019-02-24 16:58:09.147037:
osd_op(client.148580771.0:476351313 8.1d6
8:6ba6a916:::rbd_data.ba32e7238e1f29.04b3:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 3776512~4096] snapc 0=[]
ondisk+write+known_if_redirected e1242718) currently op_applied

Here the pg_num is 8.1d6

# ceph pg map 8.1d6
osdmap e1247126 pg 8.1d6 (8.1d6) -> up [14,38,24] acting [14,38,24]
[root@ceph-osd-02 ceph]# ceph pg map 8.1d6

So the problem is not necessarily is osd.14. It could also in osd.38 or
osd.24, or in the relevant hosts


>
>
> - You can use the "ceph daemon osd.x dump_historic_ops" command to debug
> some of these slow requests (to see which events take much time)
>
>
> 2019-02-25 17:40:49.550303 > initiated
>
> 2019-02-25 17:40:49.550338 > queued_for_pg
>
> 2019-02-25 17:40:49.550924 > reached_pg
>
> 2019-02-25 17:40:49.550950 > started
>
> 2019-02-25 17:40:49.550989 > waiting for subops from 21,35
>
> 2019-02-25 17:40:49.552316 > op_commit
>
> 2019-02-25 17:40:49.552320 > op_applied
>
> 2019-02-25 17:40:49.553216 > sub_op_commit_rec from 21
>
> 2019-02-25 17:41:18.416662 > sub_op_commit_rec from 35
>
> 2019-02-25 17:41:18.416708 > commit_sent
>
> 2019-02-25 17:41:18.416726 > done
>
>
> I'm not sure how to read this output  - the time is start or finish? Does
> it mean that it is waiting for OSD 21 or 35? I tried to examine few
> different OSDs for dump_historic_ops, they all seems to wait on other OSDs.
> But there is no similarity (OSD numbers are different).
>
>
> As far as I understand In this case most of the time was waiting for an
answer from osd.35

PS: You might also want to have a look at the thread "Debugging 'slow
requests'" in this mailing list where Brad Hubbard (thanks again !) helped
me debugging a 'slow request' problem

Cheers, Massimo



Best,
>
> Martin
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Files in CephFS data pool

2019-02-26 Thread Hector Martin

On 15/02/2019 22:46, Ragan, Tj (Dr.) wrote:
Is there anyway to find out which files are stored in a CephFS data 
pool?  I know you can reference the extended attributes, but those are 
only relevant for files created after ceph.dir.layout.pool 
or ceph.file.layout.pool attributes are set - I need to know about all 
the files in a pool.


As far as I can tell you *can* read the ceph.file.layout.pool xattr on 
any files in CephFS, even those that haven't had it explicitly set.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Luis Henriques
On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote:
> On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl  wrote: 
> > 
> > Hello All,
> > 
> > I am having some troubles with Ceph Quotas not working on subdirectories. I
> > am running with the following directory tree:
> > 
> > - customer
> >   - project
> > - environment
> >   - application1
> >   - application2
> >   - applicationx
> > 
> > I set a quota on environment which works perfectly fine, the client sees the
> > quota and is not breaching it. The problem starts when I try to mount a
> > subdirectory like application1, this directory does not have any quota at
> > all.
> > Is there a possibility to set a quota for environment so that the 
> > application
> > directories will not be able to go over that quota?
> 
> Can you set quotas on the application directories as well?
> setfattr -n ceph.quota.max_bytes -v  
> /environment/application1 

Right, that would work of course.  The client needs to have access to
the 'environment' directory inode in order to enforce quotas, otherwise
it won't be aware of the existence of any quotas at all.  See
"Limitations" (#4 in particular) in

 http://docs.ceph.com/docs/master/cephfs/quota/

Cheers,
--
Luís
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Hendrik Peyerl
Thank you Ramana and Luis for your quick reply.

@ Ramana: I have a quota for 300G for this specific environment, I dont want to 
split this into 100G quotas for all the subdirectories as i cannot yet forsee 
how big they will be.

@ Luis: The Client has access to the Environment directory as you can see from 
the Client Caps I sent aswell.

Thanks and best regards,

Hendrik

> On 26. Feb 2019, at 11:11, Luis Henriques  wrote:
> 
> On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote:
>> On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl  
>> wrote: 
>>> 
>>> Hello All,
>>> 
>>> I am having some troubles with Ceph Quotas not working on subdirectories. I
>>> am running with the following directory tree:
>>> 
>>> - customer
>>>  - project
>>>- environment
>>>  - application1
>>>  - application2
>>>  - applicationx
>>> 
>>> I set a quota on environment which works perfectly fine, the client sees the
>>> quota and is not breaching it. The problem starts when I try to mount a
>>> subdirectory like application1, this directory does not have any quota at
>>> all.
>>> Is there a possibility to set a quota for environment so that the 
>>> application
>>> directories will not be able to go over that quota?
>> 
>> Can you set quotas on the application directories as well?
>> setfattr -n ceph.quota.max_bytes -v  
>> /environment/application1 
> 
> Right, that would work of course.  The client needs to have access to
> the 'environment' directory inode in order to enforce quotas, otherwise
> it won't be aware of the existence of any quotas at all.  See
> "Limitations" (#4 in particular) in
> 
> http://docs.ceph.com/docs/master/cephfs/quota/
> 
> Cheers,
> --
> Luís

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin reshard stale-instances rm experience

2019-02-26 Thread Wido den Hollander



On 2/21/19 9:19 PM, Paul Emmerich wrote:
> On Thu, Feb 21, 2019 at 4:05 PM Wido den Hollander  wrote:
>> This isn't available in 13.2.4, but should be in 13.2.5, so on Mimic you
>> will need to wait. But this might bite you at some point.
> 
> Unfortunately it hasn't been backported to Mimic:
> http://tracker.ceph.com/issues/37447
> 

I see. We really need this in Mimic as well. I have another cluster,
which is running Mimic, but it's a suspect as well.

547 buckets, but 290k objects in the index pool. That ratio is not correct.

> This is the Luminous backport:
> https://github.com/ceph/ceph/pull/25326/files which looks a little bit
> messy because it fixes 3 related issues in one backport.
> 
> CC'ing devel: best way to get this in Mimic?
> 

I'd love to know as well.

Wido

> Paul
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Luis Henriques
Hendrik Peyerl  writes:

> Thank you Ramana and Luis for your quick reply.
>
> @ Ramana: I have a quota for 300G for this specific environment, I dont want 
> to
> split this into 100G quotas for all the subdirectories as i cannot yet forsee
> how big they will be.
>
> @ Luis: The Client has access to the Environment directory as you can
> see from the Client Caps I sent aswell.

Hmm.. Ok, I misunderstood your issue.

I've done a quick test and the fuse client seems to be able to handle
this scenario correctly, so I've created a bug in the tracker[1].  I'll
investigate and see if this can be fixed.

[1] https://tracker.ceph.com/issues/38482

Cheers,
-- 
Luis


>
> Thanks and best regards,
>
> Hendrik
>
>> On 26. Feb 2019, at 11:11, Luis Henriques  wrote:
>> 
>> On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote:
>>> On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl  
>>> wrote: 
 
 Hello All,
 
 I am having some troubles with Ceph Quotas not working on subdirectories. I
 am running with the following directory tree:
 
 - customer
  - project
- environment
  - application1
  - application2
  - applicationx
 
 I set a quota on environment which works perfectly fine, the client sees 
 the
 quota and is not breaching it. The problem starts when I try to mount a
 subdirectory like application1, this directory does not have any quota at
 all.
 Is there a possibility to set a quota for environment so that the 
 application
 directories will not be able to go over that quota?
>>> 
>>> Can you set quotas on the application directories as well?
>>> setfattr -n ceph.quota.max_bytes -v  
>>> /environment/application1 
>> 
>> Right, that would work of course.  The client needs to have access to
>> the 'environment' directory inode in order to enforce quotas, otherwise
>> it won't be aware of the existence of any quotas at all.  See
>> "Limitations" (#4 in particular) in
>> 
>> http://docs.ceph.com/docs/master/cephfs/quota/
>> 
>> Cheers,
>> --
>> Luís
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-26 Thread Marc Roos
 
My two cents, with default luminous cluster 4nodes, 2x mds, taking 21 
seconds to respond?? Is that not a bit long for a 4 node, 2x mds 
cluster?

After flushing caches and doing 
[@c03 sbin]# ceph mds fail c
failed mds gid 3464231


[@c04 5]# time ls -l
total 2
...

real 0m21.891s
user 0m0.002s
sys  0m0.001s

I am getting with this
ceph tell mds.a injectargs '--mds_beacon_grace=5'

Error EPERM: problem getting command descriptions from mds.a





-Original Message-
From: Patrick Donnelly [mailto:pdonn...@redhat.com] 
Sent: 20 February 2019 21:46
To: Fyodor Ustinov
Cc: ceph-users
Subject: Re: [ceph-users] faster switch to another mds

On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> From documentation:
>
> mds beacon grace
> Description:The interval without beacons before Ceph declares an 
MDS laggy (and possibly replace it).
> Type:   Float
> Default:15
>
> I do not understand, 15 - are is seconds or beacons?

seconds

> And an additional misunderstanding - if we gently turn off the MDS (or 
MON), why it does not inform everyone interested before death - "I am 
turned off, no need to wait, appoint a new active server"

The MDS does inform the monitors if it has been shutdown. If you pull 
the plug or SIGKILL, it does not. :)


--
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quotas on Subdirectories

2019-02-26 Thread Hendrik Peyerl
Thank you Luis, I’m looking forward to a solution.

> On 26. Feb 2019, at 13:10, Luis Henriques  wrote:
> 
> Hendrik Peyerl  writes:
> 
>> Thank you Ramana and Luis for your quick reply.
>> 
>> @ Ramana: I have a quota for 300G for this specific environment, I dont want 
>> to
>> split this into 100G quotas for all the subdirectories as i cannot yet forsee
>> how big they will be.
>> 
>> @ Luis: The Client has access to the Environment directory as you can
>> see from the Client Caps I sent aswell.
> 
> Hmm.. Ok, I misunderstood your issue.
> 
> I've done a quick test and the fuse client seems to be able to handle
> this scenario correctly, so I've created a bug in the tracker[1].  I'll
> investigate and see if this can be fixed.
> 
> [1] https://tracker.ceph.com/issues/38482
> 
> Cheers,
> -- 
> Luis
> 
> 
>> 
>> Thanks and best regards,
>> 
>> Hendrik
>> 
>>> On 26. Feb 2019, at 11:11, Luis Henriques  wrote:
>>> 
>>> On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote:
 On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl  
 wrote: 
> 
> Hello All,
> 
> I am having some troubles with Ceph Quotas not working on subdirectories. 
> I
> am running with the following directory tree:
> 
> - customer
> - project
>   - environment
> - application1
> - application2
> - applicationx
> 
> I set a quota on environment which works perfectly fine, the client sees 
> the
> quota and is not breaching it. The problem starts when I try to mount a
> subdirectory like application1, this directory does not have any quota at
> all.
> Is there a possibility to set a quota for environment so that the 
> application
> directories will not be able to go over that quota?
 
 Can you set quotas on the application directories as well?
 setfattr -n ceph.quota.max_bytes -v  
 /environment/application1 
>>> 
>>> Right, that would work of course.  The client needs to have access to
>>> the 'environment' directory inode in order to enforce quotas, otherwise
>>> it won't be aware of the existence of any quotas at all.  See
>>> "Limitations" (#4 in particular) in
>>> 
>>> http://docs.ceph.com/docs/master/cephfs/quota/
>>> 
>>> Cheers,
>>> --
>>> Luís
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-Site Cluster RGW Sync issues

2019-02-26 Thread Benjamin . Zieglmeier
Hello,

We have a two zone multisite configured Luminous 12.2.5 cluster. Cluster has 
been running for about 1 year, and has only ~140G of data (~350k objects). We 
recently added a third zone to the zonegroup to facilitate a migration out of 
an existing site. Sync appears to be working and running `radosgw-admin sync 
status` and `radosgw-admin sync status –rgw-zone=` reflects the 
same. The problem we are having, is that once the data replication completes, 
one of the rgws serving the new zone has the radosgw process consuming all the 
CPU, and the rgw log is flooded with “ERROR: failed to read mdlog info with (2) 
No such file or directory”, to the amount of 1000 log entries/sec.

This has been happening for days on end now, and are concerned about what is 
going on between these two zones. Logs are constantly filling up on the rgws 
and we are out of ideas. Are they trying to catch up on metadata? After 
extensive searching and racking our brains, we are unable to figure out what is 
causing all these requests (and errors) between the two zones.

Thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] redirect log to syslog and disable log to stderr

2019-02-26 Thread Alex Litvak

Dear Cephers,

In mimic 13.2.2
ceph tell mgr.* injectargs --log-to-stderr=false
Returns an error (no valid command found ...).  What is the correct way to 
inject mgr configuration values?

The same command works on mon

ceph tell mon.* injectargs --log-to-stderr=false


Thank you in advance,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph bluestore performance on 4kn vs. 512e?

2019-02-26 Thread Martin Verges
Hello Oliver,

as 512e requires the drive to read a 4k block, change the 512 byte and then
write back the 4k block to the disk, it should have a significant
performance impact. However costs are the same, so always choose 4Kn drives.
By the way, this might not affect you, as long as you write 4k at once but
I'm unsure if that is given in any use case or in a Ceph specific scenario,
therefore be save and choose 4Kn drives.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mo., 25. Feb. 2019 um 12:43 Uhr schrieb Oliver Schulz <
oliver.sch...@tu-dortmund.de>:

> Dear all,
>
> in real-world use, is there a significant performance
> benefit in using 4kn instead of 512e HDDs (using
> Ceph bluestore with block-db on NVMe-SSD)?
>
>
> Cheers and thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-26 Thread Uwe Sauter

Hi,

TL;DR: In my Ceph clusters I replaced all OSDs from HDDs of several brands and models with Samsung 860 Pro SSDs and used 
the opportunity to switch from filestore to bluestore. Now I'm seeing blocked ops in Ceph and file system freezes inside 
VMs. Any suggestions?



I have two Proxmox clusters for virtualization which use Ceph on HDDs as backend storage for VMs. About half a year ago 
I had to increase the pool size and used the occasion to switch from filestore to bluestore. That was when trouble 
started. Both clusters showed blocked ops that caused freezes inside VMs which needed a reboot to function properly 
again. I wasn't able to identify the cause of the blocking ops but I blamed the low performance of the HDDs. It was also 
the time when patches for Spectre/Meltdown were released. Kernel 4.13.x didn't show the behavior while kernel 4.15.x 
did. After several weeks of debugging the workaround was to go back to filestore.


Today I replace all HDDs with brand new Samsung 860 Pro SSDs and switched to bluestore again (on one cluster). And… the 
blocked ops reappeared. I am out of ideas about the cause.


Any idea why bluestore is so much more demanding on the storage devices 
compared to filestore?

Before switching back to filestore do you have any suggestions for debugging? 
Anything special to check for in the network?

The clusters are both connected via 10GbE (MTU 9000) and are only lightly loaded (15 VMs on the first, 6 VMs on the 
second). Each host has 3 SSDs and 64GB memory.


"rados bench" gives decent results for 4M block size but 4K block size triggers blocked ops (and only finishes after I 
restart the OSD with the blocked ops). Results below.



Thanks,

Uwe




Results from "rados bench" runs with 4K block size when the cluster didn't 
block:

root@px-hotel-cluster:~# rados bench -p scbench 60 write -b 4K -t 16 
--no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up 
to 60 seconds or 0 objects
Object prefix: benchmark_data_px-hotel-cluster_3814550
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  16  2338  2322   9.06888   9.07031   0.0068972   0.0068597
2  16  4631  4615   9.01238   8.95703   0.0076618  0.00692027
3  16  6936  6920   9.00928   9.00391   0.0066511  0.00692966
4  16  9173  9157   8.94133   8.73828  0.00416256  0.00698071
5  16 11535 11519   8.99821   9.22656  0.00799875  0.00693842
6  16 13892 13876   9.03287   9.20703  0.00688782  0.00691459
7  15 16173 16158   9.01578   8.91406  0.00791589  0.00692736
8  16 18406 18390   8.97854   8.71875  0.00745151  0.00695723
9  16 20681 20665   8.96822   8.88672   0.0072881  0.00696475
   10  16 23037 23021   8.99163   9.20312  0.00728763   0.0069473
   11  16 24261 24245   8.60882   4.78125  0.00502342  0.00725673
   12  16 25420 25404   8.26863   4.52734  0.00443917  0.00750865
   13  16 27347 27331   8.21154   7.52734  0.00670819  0.00760455
   14  16 28750 28734   8.01642   5.48047  0.00617038  0.00779322
   15  16 30222 302067.8653  5.75  0.00700398  0.00794209
   16  16 32180 321647.8517   7.64844  0.00704785   0.0079573
   17  16 34527 34511   7.92907   9.16797  0.00582831  0.00788017
   18  15 36969 36954   8.01868   9.54297  0.00635168  0.00779228
   19  16 39059 39043   8.02609   8.16016  0.00622597  0.00778436
2019-02-26 21:55:41.623245 min lat: 0.00337595 max lat: 0.431158 avg lat: 
0.00779143
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20  16 41079 41063   8.01928   7.89062  0.00649895  0.00779143
   21  16 43076 43060   8.00878   7.80078  0.00726145  0.00780128
   22  16 45433 45417   8.06321   9.20703  0.00455727  0.00774944
   23  16 47763 47747   8.10832   9.10156  0.00582818  0.00770599
   24  16 50079 50063   8.14738   9.04688   0.0051125  0.00766894
   25  16 52477 52461   8.19614   9.36719  0.00537575  0.00762343
   26  16 54895 54879   8.24415   9.44531  0.00573134  0.00757909
   27  16 57276 57260   8.28325   9.30078  0.00576683  0.00754383
   28  16 59487 59471   8.29585   8.63672  0.00651535  0.00753232
   29  16 61948 61932   8.34125   9.61328  0.00499461  0.00749048
   30  16 64289 64273   8.36799   9.14453  0.00735917  0.00746708
   31  16 66645 666298.3949   9.20312  0.00644432  0.00744233
   32  16 68926 68910   8.41098   8.91016  0.00545702   0.0074289
   33  16 71257 71241 8.432   9.10547  0.00505016  0.00741037
   34  16 73668 73652   8.460

Re: [ceph-users] ?= Intel P4600 3.2TB=?utf-8?q? U.2 form factor NVMe firmware problems causing dead disks

2019-02-26 Thread solarflow99
I knew it.  FW updates are very important for SSDs

On Sat, Feb 23, 2019 at 8:35 PM Michel Raabe  wrote:

> On Monday, February 18, 2019 16:44 CET, David Turner <
> drakonst...@gmail.com> wrote:
> > Has anyone else come across this issue before?  Our current theory is
> that
> > Bluestore is accessing the disk in a way that is triggering a bug in the
> > older firmware version that isn't triggered by more traditional
> > filesystems.  We have a scheduled call with Intel to discuss this, but
> > their preliminary searches into the bugfixes and known problems between
> > firmware versions didn't indicate the bug that we triggered.  It would be
> > good to have some more information about what those differences for disk
> > accessing might be to hopefully get a better answer from them as to what
> > the problem is.
> >
> >
> > [1]
> >
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
>
>  Yes and no. We got a same issue with the P4500 4TB. 3 disks in one day.
> In the end it was a firmware bug.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-26 Thread solarflow99
I saw Intel had a demo of a luminous cluster running on top of the line
hardware, they used 2 OSD partitions with the best performance.  I was
interested that they would split them like that, and asked the demo person
how they came to that number.  I never got a really good answer except that
it would provide better performance.  So I guess this must be why.



On Mon, Feb 25, 2019 at 8:30 PM  wrote:

> I create 2-4 RBD images sized 10GB or more with --thick-provision, then
> run
>
> fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128
> -rw=randwrite -pool=rpool -runtime=60 -rbdname=testimg
>
> For each of them at the same time.
>
> > How do you test what total 4Kb random write iops (RBD) you have?
> >
> > -Original Message-
> > From: Vitaliy Filippov [mailto:vita...@yourcmc.ru]
> > Sent: 24 February 2019 17:39
> > To: David Turner
> > Cc: ceph-users; 韦皓诚
> > Subject: *SPAM* Re: [ceph-users] Configuration about using nvme
> > SSD
> >
> > I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such
> > drives in cluster - it increased CPU consumption a lot, but total 4Kb
> > random write iops (RBD) only went from ~11000 to ~22000. So it was 2x
> > increase, but at a huge cost.
> >
> >> One thing that's worked for me to get more out of nvmes with Ceph is
> >> to create multiple partitions on the nvme with an osd on each
> > partition.
> >> That
> >> way you get more osd processes and CPU per nvme device. I've heard of
> >> people using up to 4 partitions like this.
> >
> > --
> > With best regards,
> >Vitaliy Filippov
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-26 Thread Jeff Smith
We had several postgresql servers running these disks from Dell.  Numerous
failures, including one server that had 3 die at once.  Dell claims it is a
firmware issue instructed us to upgrade to  QDV1DP15 from  QDV1DP12 (I am
not sure how these line up to the Intel firmwares).  We lost several more
during the upgrade process.  We are using ZFS with these drives.  I can
confirm it is not a Ceph Bluestore only issue.

On Mon, Feb 18, 2019 at 8:44 AM David Turner  wrote:

> We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
> (partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
> 12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
> and 30 NVMe's in total.  They were all built at the same time and were
> running firmware version QDV10130.  On this firmware version we early on
> had 2 disks failures, a few months later we had 1 more, and then a month
> after that (just a few weeks ago) we had 7 disk failures in 1 week.
>
> The failures are such that the disk is no longer visible to the OS.  This
> holds true beyond server reboots as well as placing the failed disks into a
> new server.  With a firmware upgrade tool we got an error that pretty much
> said there's no way to get data back and to RMA the disk.  We upgraded all
> of our remaining disks' firmware to QDV101D1 and haven't had any problems
> since then.  Most of our failures happened while rebalancing the cluster
> after replacing dead disks and we tested rigorously around that use case
> after upgrading the firmware.  This firmware version seems to have resolved
> whatever the problem was.
>
> We have about 100 more of these scattered among database servers and other
> servers that have never had this problem while running the
> QDV10130 firmware as well as firmwares between this one and the one we
> upgraded to.  Bluestore on Ceph is the only use case we've had so far with
> this sort of failure.
>
> Has anyone else come across this issue before?  Our current theory is that
> Bluestore is accessing the disk in a way that is triggering a bug in the
> older firmware version that isn't triggered by more traditional
> filesystems.  We have a scheduled call with Intel to discuss this, but
> their preliminary searches into the bugfixes and known problems between
> firmware versions didn't indicate the bug that we triggered.  It would be
> good to have some more information about what those differences for disk
> accessing might be to hopefully get a better answer from them as to what
> the problem is.
>
>
> [1]
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about rbd-mirror and clones

2019-02-26 Thread Jason Dillaman
On Tue, Feb 26, 2019 at 7:49 PM Anthony D'Atri  wrote:
>
> Hello again.
>
> I have a couple of questions about rbd-mirror that I'm hoping you can help me 
> with.
>
>
> 1) http://docs.ceph.com/docs/mimic/rbd/rbd-snapshot/ indicates that 
> protecting is required for cloning.  We somehow had the notion that this had 
> been / will be done away with, but don't remember where we saw that.  
> Thoughts?

By default, if the cluster is configured to to require mimic or later
clients, you no longer need to protect/unprotect snapshots prior to
cloning [1]. The documentation still talks about
protecting/unprotecting snapshots since the new clone v2 format isn't
currently enabled by default in order to preserve backwards
compatibility to older librbd/krbd clients. Once we no longer support
upgrading from pre-Mimic releases, we can enable clone v2 by default
and start deprecating snapshot protect/unprotect features.

> 2) We're currently running 12.2.2 on our cluster nodes, with rbd-mirror 
> running in a container built against 12.2.8.  Should we expect images with 
> clones / parents to successfully migrate with rbd-mirror? I've had a few rude 
> awakenings here where I've flattened to remove the dependency, but in the 
> general case would rather not have to sacrifice the underlying capacity.

Yes, thinly provisioned cloned images have always been supported with
RBD mirroring (Jewel release). You do, however, need to ensure that
the parent image has mirroring enabled.

>
>
>
> Context:  We aren't using rbd-mirror for DR, we're using it to move volumes 
> between clusters for capacity management.
>
> Hope to see you at Cephalocon.
>
>
>
>
> Anthony D'Atri
> Storage Engineer
> 425-343-5133
> ada...@digitalocean.com
> 
> We're Hiring! | @digitalocean | linkedin
>

[1] https://ceph.com/community/new-mimic-simplified-rbd-image-cloning/

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous 12.2.11 on debian 9 requires nscd?

2019-02-26 Thread Chad W Seys
Hi all,
   I cannot get my luminous 12.2.11 mds servers to start on Debian 9(.8) 
unless nscd is also installed.

   Trying to start from command line:
#  /usr/bin/ceph-mds -f --cluster ceph --id mds02.hep.wisc.edu --setuser 
ceph --setgroup ceph unable to look up group 'ceph': (34) Numerical 
result out of range

   Can look up ceph fine with 'id'
# id ceph
uid=11(ceph) gid=11(ceph) groups=11(ceph)


If I strace, I notice that an nscd directory makes an appearance:
[...]
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
lseek(3, 0, SEEK_CUR)   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=285846, ...}) = 0
mmap(NULL, 285846, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970ed2000
lseek(3, 285846, SEEK_SET)  = 285846
munmap(0x7f5970ed2000, 285846)  = 0
close(3)= 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 
-1 ENOENT (No such file or directory)
close(3)= 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 
-1 ENOENT (No such file or directory)
close(3)= 0
open("/etc/group", O_RDONLY|O_CLOEXEC)  = 3
lseek(3, 0, SEEK_CUR)   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=122355, ...}) = 0
mmap(NULL, 122355, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970efa000
lseek(3, 122355, SEEK_SET)  = 122355
lseek(3, 7495, SEEK_SET)= 7495
munmap(0x7f5970efa000, 122355)  = 0
close(3)= 0
write(2, "unable to look up group '", 25unable to look up group ') = 25
write(2, "ceph", 4ceph) = 4
write(2, "'", 1')= 1
write(2, ": ", 2: )   = 2
write(2, "(34) Numerical result out of ran"..., 34(34) Numerical result 
out of range) = 34
write(2, "\n", 1

So I install nscd and mds starts!

Shouldn't ceph be agnostic in how the ceph group is looked up?  Do I 
have some kind of config problem?

My nsswitch.conf file is below.  I've tried replacing 'compat' with 
files, but there is no change.

# cat /etc/nsswitch.conf
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd: compat
group:  compat
shadow: compat
gshadow:files

hosts:  files dns
networks:   files

protocols:  db files
services:   db files
ethers: db files
rpc:db files

netgroup:   nis


Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic and cephfs

2019-02-26 Thread Sergey Malinin
I've been using fresh 13.2.2 install in production for 4 months now without any 
issues.


February 25, 2019 10:17 PM, "Andras Pataki"  
wrote:

> Hi ceph users,
> 
> As I understand, cephfs in Mimic had significant issues up to and 
> including version 13.2.2.  With some critical patches in Mimic 13.2.4, 
> is cephfs now production quality in Mimic?  Are there folks out there 
> using it in a production setting?  If so, could you share your 
> experience with is (as compared to Luminous)?
> 
> Thanks,
> 
> Andras
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com