[ceph-users] CephFS Quotas on Subdirectories
Hello All, I am having some troubles with Ceph Quotas not working on subdirectories. I am running with the following directory tree: - customer - project - environment - application1 - application2 - applicationx I set a quota on environment which works perfectly fine, the client sees the quota and is not breaching it. The problem starts when I try to mount a subdirectory like application1, this directory does not have any quota at all. Is there a possibility to set a quota for environment so that the application directories will not be able to go over that quota? Client Caps: caps: [mds] allow rw path=/customer/project/environment caps: [mon] allow r caps: [osd] allow rw tag cephfs data=cephfs My Environment: Ceph 13.2.4 on CentOS 7.6 with Kernel 4.20.3-1 for both Servers and Clients Any help would be greatly appreciated. Best Regards, Hendrik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Right way to delete OSD from cluster?
Hi! Thank you so much! I do not understand why, but your variant really causes only one rebalance compared to the "osd out". - Original Message - From: "Scottix" To: "Fyodor Ustinov" Cc: "ceph-users" Sent: Wednesday, 30 January, 2019 20:31:32 Subject: Re: [ceph-users] Right way to delete OSD from cluster? I generally have gone the crush reweight 0 route This way the drive can participate in the rebalance, and the rebalance only happens once. Then you can take it out and purge. If I am not mistaken this is the safest. ceph osd crush reweight 0 On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov wrote: > > Hi! > > But unless after "ceph osd crush remove" I will not got the undersized > objects? That is, this is not the same thing as simply turning off the OSD > and waiting for the cluster to be restored? > > - Original Message - > From: "Wido den Hollander" > To: "Fyodor Ustinov" , "ceph-users" > Sent: Wednesday, 30 January, 2019 15:05:35 > Subject: Re: [ceph-users] Right way to delete OSD from cluster? > > On 1/30/19 2:00 PM, Fyodor Ustinov wrote: > > Hi! > > > > I thought I should first do "ceph osd out", wait for the end relocation of > > the misplaced objects and after that do "ceph osd purge". > > But after "purge" the cluster starts relocation again. > > > > Maybe I'm doing something wrong? Then what is the correct way to delete the > > OSD from the cluster? > > > > You are not doing anything wrong, this is the expected behavior. There > are two CRUSH changes: > > - Marking it out > - Purging it > > You could do: > > $ ceph osd crush remove osd.X > > Wait for all good > > $ ceph osd purge X > > The last step should then not initiate any data movement. > > Wido > > > WBR, > > Fyodor. > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- T: @Thaumion IG: Thaumion scot...@gmail.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph migration
Hi, Well, I've just reacted to all the text at the beginning of http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way including the title "the messy way". If the cluster is clean I see no reason for doing brain surgery on monmaps just to "save" a few minutes of redoing correctly from scratch. with that I would agree. Careful planning and an installation following the docs should be first priority. But I would also encourage users to experiment with ceph before going into production. Dealing with failures and outages on a production cluster causes much more headache than on a test cluster. ;-) If the cluster is empty anyway, I would also rather reinstall it, it doesn't take that much time. I just wanted to point out that there is a way that worked for me, although that was only a test cluster. Regards, Eugen Zitat von Janne Johansson : Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block : I just moved a (virtual lab) cluster to a different network, it worked like a charm. In an offline method - you need to: - set osd noout, ensure there are no OSDs up - Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP ADDRESS", MONs are the only ones really sticky with the IP - Ensure ceph.conf has the new MON IPs and network IPs - Start MONs with new monmap, then start OSDs > No, certain ips will be visible in the databases, and those will not change. I'm not sure where old IPs will be still visible, could you clarify that, please? Well, I've just reacted to all the text at the beginning of http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way including the title "the messy way". If the cluster is clean I see no reason for doing brain surgery on monmaps just to "save" a few minutes of redoing correctly from scratch. What if you miss some part, some command gives you an error you really aren't comfortable with, something doesn't really feel right after doing it, then the whole lifetime of that cluster will be followed by a small nagging feeling that it might have been that time you followed a guide that tries to talk you out of doing it that way, for a cluster with no data. I think that is the wrong way to learn how to run clusters. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quotas on Subdirectories
On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl wrote: > > Hello All, > > I am having some troubles with Ceph Quotas not working on subdirectories. I > am running with the following directory tree: > > - customer > - project > - environment > - application1 > - application2 > - applicationx > > I set a quota on environment which works perfectly fine, the client sees the > quota and is not breaching it. The problem starts when I try to mount a > subdirectory like application1, this directory does not have any quota at > all. > Is there a possibility to set a quota for environment so that the application > directories will not be able to go over that quota? Can you set quotas on the application directories as well? setfattr -n ceph.quota.max_bytes -v /environment/application1 > > Client Caps: > > caps: [mds] allow rw path=/customer/project/environment > caps: [mon] allow r > caps: [osd] allow rw tag cephfs data=cephfs > > > My Environment: > > Ceph 13.2.4 on CentOS 7.6 with Kernel 4.20.3-1 for both Servers and Clients > > > Any help would be greatly appreciated. > > Best Regards, > > Hendrik > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time
On Mon, Feb 25, 2019 at 9:26 PM mart.v wrote: > - As far as I understand the reported 'implicated osds' are only the > primary ones. In the log of the osds you should find also the relevant pg > number, and with this information you can get all the involved OSDs. This > might be useful e.g. to see if a specific OSD node is always involved. This > was my case (a the problem was with the patch cable connecting the node) > > > I can see right from the REQUEST_SLOW error log lines implicated OSDs and > therefore I can tell which nodes are involved. It is indeed on all nodes in > a cluster, no exception. So it cannot be linked to one specific node. > I am afraid I was not clear enough. Suppose that ceph health detail reports a slow request involving osd.14 In osd.14 log I see this line: 2019-02-24 16:58:39.475740 7fe25a84d700 0 log_channel(cluster) log [WRN] : slow request 30.328572 seconds old, received at 2019-02-24 16:58:09.147037: osd_op(client.148580771.0:476351313 8.1d6 8:6ba6a916:::rbd_data.ba32e7238e1f29.04b3:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3776512~4096] snapc 0=[] ondisk+write+known_if_redirected e1242718) currently op_applied Here the pg_num is 8.1d6 # ceph pg map 8.1d6 osdmap e1247126 pg 8.1d6 (8.1d6) -> up [14,38,24] acting [14,38,24] [root@ceph-osd-02 ceph]# ceph pg map 8.1d6 So the problem is not necessarily is osd.14. It could also in osd.38 or osd.24, or in the relevant hosts > > > - You can use the "ceph daemon osd.x dump_historic_ops" command to debug > some of these slow requests (to see which events take much time) > > > 2019-02-25 17:40:49.550303 > initiated > > 2019-02-25 17:40:49.550338 > queued_for_pg > > 2019-02-25 17:40:49.550924 > reached_pg > > 2019-02-25 17:40:49.550950 > started > > 2019-02-25 17:40:49.550989 > waiting for subops from 21,35 > > 2019-02-25 17:40:49.552316 > op_commit > > 2019-02-25 17:40:49.552320 > op_applied > > 2019-02-25 17:40:49.553216 > sub_op_commit_rec from 21 > > 2019-02-25 17:41:18.416662 > sub_op_commit_rec from 35 > > 2019-02-25 17:41:18.416708 > commit_sent > > 2019-02-25 17:41:18.416726 > done > > > I'm not sure how to read this output - the time is start or finish? Does > it mean that it is waiting for OSD 21 or 35? I tried to examine few > different OSDs for dump_historic_ops, they all seems to wait on other OSDs. > But there is no similarity (OSD numbers are different). > > > As far as I understand In this case most of the time was waiting for an answer from osd.35 PS: You might also want to have a look at the thread "Debugging 'slow requests'" in this mailing list where Brad Hubbard (thanks again !) helped me debugging a 'slow request' problem Cheers, Massimo Best, > > Martin > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Files in CephFS data pool
On 15/02/2019 22:46, Ragan, Tj (Dr.) wrote: Is there anyway to find out which files are stored in a CephFS data pool? I know you can reference the extended attributes, but those are only relevant for files created after ceph.dir.layout.pool or ceph.file.layout.pool attributes are set - I need to know about all the files in a pool. As far as I can tell you *can* read the ceph.file.layout.pool xattr on any files in CephFS, even those that haven't had it explicitly set. -- Hector Martin (hec...@marcansoft.com) Public Key: https://mrcn.st/pub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quotas on Subdirectories
On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote: > On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl wrote: > > > > Hello All, > > > > I am having some troubles with Ceph Quotas not working on subdirectories. I > > am running with the following directory tree: > > > > - customer > > - project > > - environment > > - application1 > > - application2 > > - applicationx > > > > I set a quota on environment which works perfectly fine, the client sees the > > quota and is not breaching it. The problem starts when I try to mount a > > subdirectory like application1, this directory does not have any quota at > > all. > > Is there a possibility to set a quota for environment so that the > > application > > directories will not be able to go over that quota? > > Can you set quotas on the application directories as well? > setfattr -n ceph.quota.max_bytes -v > /environment/application1 Right, that would work of course. The client needs to have access to the 'environment' directory inode in order to enforce quotas, otherwise it won't be aware of the existence of any quotas at all. See "Limitations" (#4 in particular) in http://docs.ceph.com/docs/master/cephfs/quota/ Cheers, -- Luís ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quotas on Subdirectories
Thank you Ramana and Luis for your quick reply. @ Ramana: I have a quota for 300G for this specific environment, I dont want to split this into 100G quotas for all the subdirectories as i cannot yet forsee how big they will be. @ Luis: The Client has access to the Environment directory as you can see from the Client Caps I sent aswell. Thanks and best regards, Hendrik > On 26. Feb 2019, at 11:11, Luis Henriques wrote: > > On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote: >> On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl >> wrote: >>> >>> Hello All, >>> >>> I am having some troubles with Ceph Quotas not working on subdirectories. I >>> am running with the following directory tree: >>> >>> - customer >>> - project >>>- environment >>> - application1 >>> - application2 >>> - applicationx >>> >>> I set a quota on environment which works perfectly fine, the client sees the >>> quota and is not breaching it. The problem starts when I try to mount a >>> subdirectory like application1, this directory does not have any quota at >>> all. >>> Is there a possibility to set a quota for environment so that the >>> application >>> directories will not be able to go over that quota? >> >> Can you set quotas on the application directories as well? >> setfattr -n ceph.quota.max_bytes -v >> /environment/application1 > > Right, that would work of course. The client needs to have access to > the 'environment' directory inode in order to enforce quotas, otherwise > it won't be aware of the existence of any quotas at all. See > "Limitations" (#4 in particular) in > > http://docs.ceph.com/docs/master/cephfs/quota/ > > Cheers, > -- > Luís ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw-admin reshard stale-instances rm experience
On 2/21/19 9:19 PM, Paul Emmerich wrote: > On Thu, Feb 21, 2019 at 4:05 PM Wido den Hollander wrote: >> This isn't available in 13.2.4, but should be in 13.2.5, so on Mimic you >> will need to wait. But this might bite you at some point. > > Unfortunately it hasn't been backported to Mimic: > http://tracker.ceph.com/issues/37447 > I see. We really need this in Mimic as well. I have another cluster, which is running Mimic, but it's a suspect as well. 547 buckets, but 290k objects in the index pool. That ratio is not correct. > This is the Luminous backport: > https://github.com/ceph/ceph/pull/25326/files which looks a little bit > messy because it fixes 3 related issues in one backport. > > CC'ing devel: best way to get this in Mimic? > I'd love to know as well. Wido > Paul > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quotas on Subdirectories
Hendrik Peyerl writes: > Thank you Ramana and Luis for your quick reply. > > @ Ramana: I have a quota for 300G for this specific environment, I dont want > to > split this into 100G quotas for all the subdirectories as i cannot yet forsee > how big they will be. > > @ Luis: The Client has access to the Environment directory as you can > see from the Client Caps I sent aswell. Hmm.. Ok, I misunderstood your issue. I've done a quick test and the fuse client seems to be able to handle this scenario correctly, so I've created a bug in the tracker[1]. I'll investigate and see if this can be fixed. [1] https://tracker.ceph.com/issues/38482 Cheers, -- Luis > > Thanks and best regards, > > Hendrik > >> On 26. Feb 2019, at 11:11, Luis Henriques wrote: >> >> On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote: >>> On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl >>> wrote: Hello All, I am having some troubles with Ceph Quotas not working on subdirectories. I am running with the following directory tree: - customer - project - environment - application1 - application2 - applicationx I set a quota on environment which works perfectly fine, the client sees the quota and is not breaching it. The problem starts when I try to mount a subdirectory like application1, this directory does not have any quota at all. Is there a possibility to set a quota for environment so that the application directories will not be able to go over that quota? >>> >>> Can you set quotas on the application directories as well? >>> setfattr -n ceph.quota.max_bytes -v >>> /environment/application1 >> >> Right, that would work of course. The client needs to have access to >> the 'environment' directory inode in order to enforce quotas, otherwise >> it won't be aware of the existence of any quotas at all. See >> "Limitations" (#4 in particular) in >> >> http://docs.ceph.com/docs/master/cephfs/quota/ >> >> Cheers, >> -- >> Luís > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] faster switch to another mds
My two cents, with default luminous cluster 4nodes, 2x mds, taking 21 seconds to respond?? Is that not a bit long for a 4 node, 2x mds cluster? After flushing caches and doing [@c03 sbin]# ceph mds fail c failed mds gid 3464231 [@c04 5]# time ls -l total 2 ... real 0m21.891s user 0m0.002s sys 0m0.001s I am getting with this ceph tell mds.a injectargs '--mds_beacon_grace=5' Error EPERM: problem getting command descriptions from mds.a -Original Message- From: Patrick Donnelly [mailto:pdonn...@redhat.com] Sent: 20 February 2019 21:46 To: Fyodor Ustinov Cc: ceph-users Subject: Re: [ceph-users] faster switch to another mds On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov wrote: > > Hi! > > From documentation: > > mds beacon grace > Description:The interval without beacons before Ceph declares an MDS laggy (and possibly replace it). > Type: Float > Default:15 > > I do not understand, 15 - are is seconds or beacons? seconds > And an additional misunderstanding - if we gently turn off the MDS (or MON), why it does not inform everyone interested before death - "I am turned off, no need to wait, appoint a new active server" The MDS does inform the monitors if it has been shutdown. If you pull the plug or SIGKILL, it does not. :) -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quotas on Subdirectories
Thank you Luis, I’m looking forward to a solution. > On 26. Feb 2019, at 13:10, Luis Henriques wrote: > > Hendrik Peyerl writes: > >> Thank you Ramana and Luis for your quick reply. >> >> @ Ramana: I have a quota for 300G for this specific environment, I dont want >> to >> split this into 100G quotas for all the subdirectories as i cannot yet forsee >> how big they will be. >> >> @ Luis: The Client has access to the Environment directory as you can >> see from the Client Caps I sent aswell. > > Hmm.. Ok, I misunderstood your issue. > > I've done a quick test and the fuse client seems to be able to handle > this scenario correctly, so I've created a bug in the tracker[1]. I'll > investigate and see if this can be fixed. > > [1] https://tracker.ceph.com/issues/38482 > > Cheers, > -- > Luis > > >> >> Thanks and best regards, >> >> Hendrik >> >>> On 26. Feb 2019, at 11:11, Luis Henriques wrote: >>> >>> On Tue, Feb 26, 2019 at 03:47:31AM -0500, Ramana Raja wrote: On Tue, Feb 26, 2019 at 1:38 PM, Hendrik Peyerl wrote: > > Hello All, > > I am having some troubles with Ceph Quotas not working on subdirectories. > I > am running with the following directory tree: > > - customer > - project > - environment > - application1 > - application2 > - applicationx > > I set a quota on environment which works perfectly fine, the client sees > the > quota and is not breaching it. The problem starts when I try to mount a > subdirectory like application1, this directory does not have any quota at > all. > Is there a possibility to set a quota for environment so that the > application > directories will not be able to go over that quota? Can you set quotas on the application directories as well? setfattr -n ceph.quota.max_bytes -v /environment/application1 >>> >>> Right, that would work of course. The client needs to have access to >>> the 'environment' directory inode in order to enforce quotas, otherwise >>> it won't be aware of the existence of any quotas at all. See >>> "Limitations" (#4 in particular) in >>> >>> http://docs.ceph.com/docs/master/cephfs/quota/ >>> >>> Cheers, >>> -- >>> Luís >> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-Site Cluster RGW Sync issues
Hello, We have a two zone multisite configured Luminous 12.2.5 cluster. Cluster has been running for about 1 year, and has only ~140G of data (~350k objects). We recently added a third zone to the zonegroup to facilitate a migration out of an existing site. Sync appears to be working and running `radosgw-admin sync status` and `radosgw-admin sync status –rgw-zone=` reflects the same. The problem we are having, is that once the data replication completes, one of the rgws serving the new zone has the radosgw process consuming all the CPU, and the rgw log is flooded with “ERROR: failed to read mdlog info with (2) No such file or directory”, to the amount of 1000 log entries/sec. This has been happening for days on end now, and are concerned about what is going on between these two zones. Logs are constantly filling up on the rgws and we are out of ideas. Are they trying to catch up on metadata? After extensive searching and racking our brains, we are unable to figure out what is causing all these requests (and errors) between the two zones. Thanks, Ben ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] redirect log to syslog and disable log to stderr
Dear Cephers, In mimic 13.2.2 ceph tell mgr.* injectargs --log-to-stderr=false Returns an error (no valid command found ...). What is the correct way to inject mgr configuration values? The same command works on mon ceph tell mon.* injectargs --log-to-stderr=false Thank you in advance, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph bluestore performance on 4kn vs. 512e?
Hello Oliver, as 512e requires the drive to read a 4k block, change the 512 byte and then write back the 4k block to the disk, it should have a significant performance impact. However costs are the same, so always choose 4Kn drives. By the way, this might not affect you, as long as you write 4k at once but I'm unsure if that is given in any use case or in a Ceph specific scenario, therefore be save and choose 4Kn drives. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Mo., 25. Feb. 2019 um 12:43 Uhr schrieb Oliver Schulz < oliver.sch...@tu-dortmund.de>: > Dear all, > > in real-world use, is there a significant performance > benefit in using 4kn instead of 512e HDDs (using > Ceph bluestore with block-db on NVMe-SSD)? > > > Cheers and thanks for any advice, > > Oliver > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD
Hi, TL;DR: In my Ceph clusters I replaced all OSDs from HDDs of several brands and models with Samsung 860 Pro SSDs and used the opportunity to switch from filestore to bluestore. Now I'm seeing blocked ops in Ceph and file system freezes inside VMs. Any suggestions? I have two Proxmox clusters for virtualization which use Ceph on HDDs as backend storage for VMs. About half a year ago I had to increase the pool size and used the occasion to switch from filestore to bluestore. That was when trouble started. Both clusters showed blocked ops that caused freezes inside VMs which needed a reboot to function properly again. I wasn't able to identify the cause of the blocking ops but I blamed the low performance of the HDDs. It was also the time when patches for Spectre/Meltdown were released. Kernel 4.13.x didn't show the behavior while kernel 4.15.x did. After several weeks of debugging the workaround was to go back to filestore. Today I replace all HDDs with brand new Samsung 860 Pro SSDs and switched to bluestore again (on one cluster). And… the blocked ops reappeared. I am out of ideas about the cause. Any idea why bluestore is so much more demanding on the storage devices compared to filestore? Before switching back to filestore do you have any suggestions for debugging? Anything special to check for in the network? The clusters are both connected via 10GbE (MTU 9000) and are only lightly loaded (15 VMs on the first, 6 VMs on the second). Each host has 3 SSDs and 64GB memory. "rados bench" gives decent results for 4M block size but 4K block size triggers blocked ops (and only finishes after I restart the OSD with the blocked ops). Results below. Thanks, Uwe Results from "rados bench" runs with 4K block size when the cluster didn't block: root@px-hotel-cluster:~# rados bench -p scbench 60 write -b 4K -t 16 --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 60 seconds or 0 objects Object prefix: benchmark_data_px-hotel-cluster_3814550 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 2338 2322 9.06888 9.07031 0.0068972 0.0068597 2 16 4631 4615 9.01238 8.95703 0.0076618 0.00692027 3 16 6936 6920 9.00928 9.00391 0.0066511 0.00692966 4 16 9173 9157 8.94133 8.73828 0.00416256 0.00698071 5 16 11535 11519 8.99821 9.22656 0.00799875 0.00693842 6 16 13892 13876 9.03287 9.20703 0.00688782 0.00691459 7 15 16173 16158 9.01578 8.91406 0.00791589 0.00692736 8 16 18406 18390 8.97854 8.71875 0.00745151 0.00695723 9 16 20681 20665 8.96822 8.88672 0.0072881 0.00696475 10 16 23037 23021 8.99163 9.20312 0.00728763 0.0069473 11 16 24261 24245 8.60882 4.78125 0.00502342 0.00725673 12 16 25420 25404 8.26863 4.52734 0.00443917 0.00750865 13 16 27347 27331 8.21154 7.52734 0.00670819 0.00760455 14 16 28750 28734 8.01642 5.48047 0.00617038 0.00779322 15 16 30222 302067.8653 5.75 0.00700398 0.00794209 16 16 32180 321647.8517 7.64844 0.00704785 0.0079573 17 16 34527 34511 7.92907 9.16797 0.00582831 0.00788017 18 15 36969 36954 8.01868 9.54297 0.00635168 0.00779228 19 16 39059 39043 8.02609 8.16016 0.00622597 0.00778436 2019-02-26 21:55:41.623245 min lat: 0.00337595 max lat: 0.431158 avg lat: 0.00779143 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 20 16 41079 41063 8.01928 7.89062 0.00649895 0.00779143 21 16 43076 43060 8.00878 7.80078 0.00726145 0.00780128 22 16 45433 45417 8.06321 9.20703 0.00455727 0.00774944 23 16 47763 47747 8.10832 9.10156 0.00582818 0.00770599 24 16 50079 50063 8.14738 9.04688 0.0051125 0.00766894 25 16 52477 52461 8.19614 9.36719 0.00537575 0.00762343 26 16 54895 54879 8.24415 9.44531 0.00573134 0.00757909 27 16 57276 57260 8.28325 9.30078 0.00576683 0.00754383 28 16 59487 59471 8.29585 8.63672 0.00651535 0.00753232 29 16 61948 61932 8.34125 9.61328 0.00499461 0.00749048 30 16 64289 64273 8.36799 9.14453 0.00735917 0.00746708 31 16 66645 666298.3949 9.20312 0.00644432 0.00744233 32 16 68926 68910 8.41098 8.91016 0.00545702 0.0074289 33 16 71257 71241 8.432 9.10547 0.00505016 0.00741037 34 16 73668 73652 8.460
Re: [ceph-users] ?= Intel P4600 3.2TB=?utf-8?q? U.2 form factor NVMe firmware problems causing dead disks
I knew it. FW updates are very important for SSDs On Sat, Feb 23, 2019 at 8:35 PM Michel Raabe wrote: > On Monday, February 18, 2019 16:44 CET, David Turner < > drakonst...@gmail.com> wrote: > > Has anyone else come across this issue before? Our current theory is > that > > Bluestore is accessing the disk in a way that is triggering a bug in the > > older firmware version that isn't triggered by more traditional > > filesystems. We have a scheduled call with Intel to discuss this, but > > their preliminary searches into the bugfixes and known problems between > > firmware versions didn't indicate the bug that we triggered. It would be > > good to have some more information about what those differences for disk > > accessing might be to hopefully get a better answer from them as to what > > the problem is. > > > > > > [1] > > > https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html > > Yes and no. We got a same issue with the P4500 4TB. 3 disks in one day. > In the end it was a firmware bug. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
I saw Intel had a demo of a luminous cluster running on top of the line hardware, they used 2 OSD partitions with the best performance. I was interested that they would split them like that, and asked the demo person how they came to that number. I never got a really good answer except that it would provide better performance. So I guess this must be why. On Mon, Feb 25, 2019 at 8:30 PM wrote: > I create 2-4 RBD images sized 10GB or more with --thick-provision, then > run > > fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 > -rw=randwrite -pool=rpool -runtime=60 -rbdname=testimg > > For each of them at the same time. > > > How do you test what total 4Kb random write iops (RBD) you have? > > > > -Original Message- > > From: Vitaliy Filippov [mailto:vita...@yourcmc.ru] > > Sent: 24 February 2019 17:39 > > To: David Turner > > Cc: ceph-users; 韦皓诚 > > Subject: *SPAM* Re: [ceph-users] Configuration about using nvme > > SSD > > > > I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such > > drives in cluster - it increased CPU consumption a lot, but total 4Kb > > random write iops (RBD) only went from ~11000 to ~22000. So it was 2x > > increase, but at a huge cost. > > > >> One thing that's worked for me to get more out of nvmes with Ceph is > >> to create multiple partitions on the nvme with an osd on each > > partition. > >> That > >> way you get more osd processes and CPU per nvme device. I've heard of > >> people using up to 4 partitions like this. > > > > -- > > With best regards, > >Vitaliy Filippov > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks
We had several postgresql servers running these disks from Dell. Numerous failures, including one server that had 3 die at once. Dell claims it is a firmware issue instructed us to upgrade to QDV1DP15 from QDV1DP12 (I am not sure how these line up to the Intel firmwares). We lost several more during the upgrade process. We are using ZFS with these drives. I can confirm it is not a Ceph Bluestore only issue. On Mon, Feb 18, 2019 at 8:44 AM David Turner wrote: > We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk > (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are > 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster > and 30 NVMe's in total. They were all built at the same time and were > running firmware version QDV10130. On this firmware version we early on > had 2 disks failures, a few months later we had 1 more, and then a month > after that (just a few weeks ago) we had 7 disk failures in 1 week. > > The failures are such that the disk is no longer visible to the OS. This > holds true beyond server reboots as well as placing the failed disks into a > new server. With a firmware upgrade tool we got an error that pretty much > said there's no way to get data back and to RMA the disk. We upgraded all > of our remaining disks' firmware to QDV101D1 and haven't had any problems > since then. Most of our failures happened while rebalancing the cluster > after replacing dead disks and we tested rigorously around that use case > after upgrading the firmware. This firmware version seems to have resolved > whatever the problem was. > > We have about 100 more of these scattered among database servers and other > servers that have never had this problem while running the > QDV10130 firmware as well as firmwares between this one and the one we > upgraded to. Bluestore on Ceph is the only use case we've had so far with > this sort of failure. > > Has anyone else come across this issue before? Our current theory is that > Bluestore is accessing the disk in a way that is triggering a bug in the > older firmware version that isn't triggered by more traditional > filesystems. We have a scheduled call with Intel to discuss this, but > their preliminary searches into the bugfixes and known problems between > firmware versions didn't indicate the bug that we triggered. It would be > good to have some more information about what those differences for disk > accessing might be to hopefully get a better answer from them as to what > the problem is. > > > [1] > https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions about rbd-mirror and clones
On Tue, Feb 26, 2019 at 7:49 PM Anthony D'Atri wrote: > > Hello again. > > I have a couple of questions about rbd-mirror that I'm hoping you can help me > with. > > > 1) http://docs.ceph.com/docs/mimic/rbd/rbd-snapshot/ indicates that > protecting is required for cloning. We somehow had the notion that this had > been / will be done away with, but don't remember where we saw that. > Thoughts? By default, if the cluster is configured to to require mimic or later clients, you no longer need to protect/unprotect snapshots prior to cloning [1]. The documentation still talks about protecting/unprotecting snapshots since the new clone v2 format isn't currently enabled by default in order to preserve backwards compatibility to older librbd/krbd clients. Once we no longer support upgrading from pre-Mimic releases, we can enable clone v2 by default and start deprecating snapshot protect/unprotect features. > 2) We're currently running 12.2.2 on our cluster nodes, with rbd-mirror > running in a container built against 12.2.8. Should we expect images with > clones / parents to successfully migrate with rbd-mirror? I've had a few rude > awakenings here where I've flattened to remove the dependency, but in the > general case would rather not have to sacrifice the underlying capacity. Yes, thinly provisioned cloned images have always been supported with RBD mirroring (Jewel release). You do, however, need to ensure that the parent image has mirroring enabled. > > > > Context: We aren't using rbd-mirror for DR, we're using it to move volumes > between clusters for capacity management. > > Hope to see you at Cephalocon. > > > > > Anthony D'Atri > Storage Engineer > 425-343-5133 > ada...@digitalocean.com > > We're Hiring! | @digitalocean | linkedin > [1] https://ceph.com/community/new-mimic-simplified-rbd-image-cloning/ -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] luminous 12.2.11 on debian 9 requires nscd?
Hi all, I cannot get my luminous 12.2.11 mds servers to start on Debian 9(.8) unless nscd is also installed. Trying to start from command line: # /usr/bin/ceph-mds -f --cluster ceph --id mds02.hep.wisc.edu --setuser ceph --setgroup ceph unable to look up group 'ceph': (34) Numerical result out of range Can look up ceph fine with 'id' # id ceph uid=11(ceph) gid=11(ceph) groups=11(ceph) If I strace, I notice that an nscd directory makes an appearance: [...] open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=285846, ...}) = 0 mmap(NULL, 285846, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970ed2000 lseek(3, 285846, SEEK_SET) = 285846 munmap(0x7f5970ed2000, 285846) = 0 close(3)= 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(3)= 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(3)= 0 open("/etc/group", O_RDONLY|O_CLOEXEC) = 3 lseek(3, 0, SEEK_CUR) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=122355, ...}) = 0 mmap(NULL, 122355, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970efa000 lseek(3, 122355, SEEK_SET) = 122355 lseek(3, 7495, SEEK_SET)= 7495 munmap(0x7f5970efa000, 122355) = 0 close(3)= 0 write(2, "unable to look up group '", 25unable to look up group ') = 25 write(2, "ceph", 4ceph) = 4 write(2, "'", 1')= 1 write(2, ": ", 2: ) = 2 write(2, "(34) Numerical result out of ran"..., 34(34) Numerical result out of range) = 34 write(2, "\n", 1 So I install nscd and mds starts! Shouldn't ceph be agnostic in how the ceph group is looked up? Do I have some kind of config problem? My nsswitch.conf file is below. I've tried replacing 'compat' with files, but there is no change. # cat /etc/nsswitch.conf # /etc/nsswitch.conf # # Example configuration of GNU Name Service Switch functionality. # If you have the `glibc-doc-reference' and `info' packages installed, try: # `info libc "Name Service Switch"' for information about this file. passwd: compat group: compat shadow: compat gshadow:files hosts: files dns networks: files protocols: db files services: db files ethers: db files rpc:db files netgroup: nis Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic and cephfs
I've been using fresh 13.2.2 install in production for 4 months now without any issues. February 25, 2019 10:17 PM, "Andras Pataki" wrote: > Hi ceph users, > > As I understand, cephfs in Mimic had significant issues up to and > including version 13.2.2. With some critical patches in Mimic 13.2.4, > is cephfs now production quality in Mimic? Are there folks out there > using it in a production setting? If so, could you share your > experience with is (as compared to Luminous)? > > Thanks, > > Andras > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com