from:"Bryan Stillwell"

[ceph-users] centos and 'print continue' support

2014-05-23 Thread Bryan Stillwell

Yesterday I went through manually configuring a ceph cluster with a rados gateway on centos 6.5, and I have a question about the documentation. On this page: https://ceph.com/docs/master/radosgw/config/ It mentions "On CentOS/RHEL distributions, turn off print continue. If you have it set to tru

Re: [ceph-users] Performance issues with small files

2013-09-04 Thread Bryan Stillwell

> Any help would be greatly appreciated. > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- [image: Photobucket] <http://photobucket.com>

Re: [ceph-users] Performance issues with small files

2013-09-04 Thread Bryan Stillwell

wrote: > Bryan, > > Good explanation. How's performance now that you've spread the load over > multiple buckets? > > Mark > > On 09/04/2013 12:39 PM, Bryan Stillwell wrote: > >> Bill, >> >> I've run into a similar issue with objects averagi

Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell

; >>> Bryan >>> >>> >>> On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson >>> mailto:mark.nel...@inktank.com>> wrote: >>> >>> Bryan, >>> >>> Good explanation. How's performance now that you'v

Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell

that things have slowed down a bit. The average upload rate over those first 20 hours was ~48 objects/second, but now I'm only seeing ~20 objects/second. This is with 18,836 buckets. Bryan On Wed, Sep 4, 2013 at 12:43 PM, Bryan Stillwell wrote: > So far I haven't seen much of a c

Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell

but they are not the > easiest tools to setup/use). > > Mark > > On 09/05/2013 11:59 AM, Bryan Stillwell wrote: >> >> Mark, >> >> Yesterday I blew away all the objects and restarted my test using >> multiple buckets, and things are definitely better! >&

[ceph-users] Full OSD with 29% free

2013-10-14 Thread Bryan Stillwell

This appears to be more of an XFS issue than a ceph issue, but I've run into a problem where some of my OSDs failed because the filesystem was reported as full even though there was 29% free: [root@den2ceph001 ceph-1]# touch blah touch: cannot touch `blah': No space left on device [root@den2ceph00

Re: [ceph-users] Full OSD with 29% free

2013-10-14 Thread Bryan Stillwell

ag -r /dev/sdc1 actual 3481543, ideal 3447443, fragmentation factor 0.98% Bryan On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe wrote: > > How fragmented is that file system? > > Sent from my iPad > > > On Oct 14, 2013, at 5:44 PM, Bryan Stillwell > > wrote: > > >

Re: [ceph-users] Full OSD with 29% free

2013-10-21 Thread Bryan Stillwell

x27;m wondering is if reducing the block size from 4K to 2K (or 1K) would help? I'm pretty sure this would take require re-running mkfs.xfs on every OSD to fix if that's the case... Thanks, Bryan On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell wrote: > > The filesystem isn&

Re: [ceph-users] Full OSD with 29% free

2013-10-30 Thread Bryan Stillwell

] osd_mount_options_xfs = "rw,noatime,inode64" osd_mkfs_options_xfs = "-f -b size=2048" The cluster is currently running the 0.71 release. Bryan On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell wrote: > So I'm running into this issue again and after spending a bit

Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell

9 > > ____ > From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] > on behalf of Bryan Stillwell [bstillw...@photobucket.com] > Sent: Wednesday, October 30, 2013 2:18 PM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-use

Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell

free blocks 91693664 average free extent size 44.7352 That gives me a little more confidence in using 2K block sizes now. :) Bryan On Thu, Oct 31, 2013 at 11:02 AM, Bryan Stillwell wrote: > Shain, > > After getting the segfaults when running 'xfs_db -r "-c freesp -s"'

[ceph-users] Recover from corrupted journals

2013-11-12 Thread Bryan Stillwell

While updating my cluster to use a 2K block size for XFS, I've run into a couple OSDs failing to start because of corrupted journals: === osd.1 === -10> 2013-11-12 13:40:35.388177 7f030458a7a0 1 filestore(/var/lib/ceph/osd/ceph-1) mount detected xfs -9> 2013-11-12 13:40:35.388194 7f030458a

Re: [ceph-users] CephFS First product release discussion

2013-03-05 Thread Bryan Stillwell

On Tue, Mar 5, 2013 at 12:44 PM, Kevin Decherf wrote: > > On Tue, Mar 05, 2013 at 12:27:04PM -0600, Dino Yancey wrote: > > The only two features I'd deem necessary for our workload would be > > stable distributed metadata / MDS and a working fsck equivalent. > > Snapshots would be great once the f

[ceph-users] Bobtail & Precise

2013-04-03 Thread Bryan Stillwell

I have two test clusters running Bobtail (0.56.4) and Ubuntu Precise (12.04.2). The problem I'm having is that I'm not able to get either of them into a state where I can both mount the filesystem and have all the PGs in the active+clean state. It seems that on both clusters I can get them into a

Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell

1 PM, John Wilkins wrote: > Bryan, > > It seems you got crickets with this question. Did you get any further? I'd > like to add it to my upcoming CRUSH troubleshooting section. > > > On Wed, Apr 3, 2013 at 9:27 AM, Bryan Stillwell < > bstillw...@photobucket.com> wrote:

Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell

//ceph.com > > > On Thu, Apr 18, 2013 at 12:51 PM, John Wilkins > wrote: > > Bryan, > > > > It seems you got crickets with this question. Did you get any further? > I'd > > like to add it to my upcoming CRUSH troubleshooting section. > > &

Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell

the tunables. In setups where your branching factors aren't very close to > your replication counts they aren't normally needed, if you want to reshape > your cluster a little bit. > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On

[ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

I've run into an issue where after copying a file to my cephfs cluster the md5sums no longer match. I believe I've tracked it down to some parts of the file which are missing: $ obj_name=$(cephfs "title1.mkv" show_location -l 0 | grep object_name | sed -e "s/.*:\W*$[0-9a-f]*$\.[0-9a-f]*/\1/") $

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

: > On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell > wrote: >> I've run into an issue where after copying a file to my cephfs cluster >> the md5sums no longer match. I believe I've tracked it down to some >> parts of the file which are missing: >> >

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

hout the debugfs stuff > being enabled. :/ > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell > wrote: >> I've tried a few different ones: >> >> 1. cp to cephfs mounted filesystem on

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

e, Apr 23, 2013 at 4:41 PM, Gregory Farnum wrote: > On Tue, Apr 23, 2013 at 3:37 PM, Bryan Stillwell > wrote: >> I'm using the kernel client that's built into precise & quantal. >> >> I could give the ceph-fuse client a try and see if it has the same >>

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil wrote: > > On Tue, 23 Apr 2013, Bryan Stillwell wrote: > > I'm testing this now, but while going through the logs I saw something > > that might have something to do with this: > > > > Apr 23 16:35:28 a1 kernel: [

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

On Tue, Apr 23, 2013 at 5:45 PM, Sage Weil wrote: > On Tue, 23 Apr 2013, Bryan Stillwell wrote: >> On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil wrote: >> > >> > On Tue, 23 Apr 2013, Bryan Stillwell wrote: >> > > I'm testing this now, but while going

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell

On Tue, Apr 23, 2013 at 5:54 PM, Gregory Farnum wrote: > On Tue, Apr 23, 2013 at 4:45 PM, Sage Weil wrote: >> On Tue, 23 Apr 2013, Bryan Stillwell wrote: >>> On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil wrote: >>> > >>> > On Tue, 23 Apr 2013, Bryan Sti

[ceph-users] ceph-deploy documentation fixes

2013-05-07 Thread Bryan Stillwell

With the release of cuttlefish, I decided to try out ceph-deploy and ran into some documentation errors along the way: http://ceph.com/docs/master/rados/deployment/preflight-checklist/ Under 'CREATE A USER' it has the following line: To provide full privileges to the user, add the following to

[ceph-users] mon problems after upgrading to cuttlefish

2013-05-22 Thread Bryan Stillwell

I attempted to upgrade my bobtail cluster to cuttlefish tonight and I believe I'm running into some mon related issues. I did the original install manually instead of with mkcephfs or ceph-deploy, so I think that might have to do with this error: root@a1:~# ceph-mon -d -c /etc/ceph/ceph.conf 2013

Re: [ceph-users] mon problems after upgrading to cuttlefish

2013-05-23 Thread Bryan Stillwell

On Thu, May 23, 2013 at 9:58 AM, Smart Weblications GmbH - Florian Wiessner wrote: > you may need to update your [mon.a] section in your ceph.conf like this: > > > [mon.a] >mon data = /var/lib/ceph/mon/ceph-a/ That didn't seem to make a difference, it kept trying to use ceph-admin. I tri

[ceph-users] Failure increasing mons from 1 to 3

2013-05-25 Thread Bryan Stillwell

Shortly after upgrading from bobtail to cuttlefish I tried increasing the number of monitors in my small test cluster from 1 to 3, but I believe I messed something up in the process. At first I thought the conversion to leveldb failed, but after digging into it a bit I believe this explains it: #

Re: [ceph-users] mon problems after upgrading to cuttlefish

2013-05-28 Thread Bryan Stillwell

"a", "addr": "172.24.88.50:6789\/0"}, { "rank": 1, "name": "mon.b", "addr": "172.24.88.53:6789\/0"}]}} Any ideas how to get rid of mon.b? Thanks, Bryan On

[ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell

I have a cluster I originally built on argonaut and have since upgraded it to bobtail and then cuttlefish. I originally configured it with one node for both the mds node and mon node, and 4 other nodes for hosting osd's: a1: mon.a/mds.a b1: osd.0, osd.1, osd.2, osd.3, osd.4, osd.20 b2: osd.5, osd

Re: [ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell

On Tue, Jun 11, 2013 at 3:50 PM, Gregory Farnum wrote: > You should not run more than one active MDS (less stable than a > single-MDS configuration, bla bla bla), but you can run multiple > daemons and let the extras serve as a backup in case of failure. The > process for moving an MDS is pretty e

[ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell

I'm in the process of cleaning up a test that an internal customer did on our production cluster that produced over a billion objects spread across 6000 buckets. So far I've been removing the buckets like this: printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm --buck

Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell

Wouldn't doing it that way cause problems since references to the objects wouldn't be getting removed from .rgw.buckets.index? Bryan From: Roger Brown Date: Monday, July 24, 2017 at 2:43 PM To: Bryan Stillwell , "ceph-users@lists.ceph.com" Subject: Re: [ceph-users]

Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell

nable amount of time. Thanks, Bryan From: Pavan Rallabhandi Date: Tuesday, July 25, 2017 at 3:00 AM To: Bryan Stillwell , "ceph-users@lists.ceph.com" Subject: Re: [ceph-users] Speeding up garbage collection in RGW If your Ceph version is >=Jewel, you can try the `--bypass-gc` opti

Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell

Excellent, thank you! It does exist in 0.94.10! :) Bryan From: Pavan Rallabhandi Date: Tuesday, July 25, 2017 at 11:21 AM To: Bryan Stillwell , "ceph-users@lists.ceph.com" Subject: Re: [ceph-users] Speeding up garbage collection in RGW I’ve just realized that the option is

Re: [ceph-users] expanding cluster with minimal impact

2017-08-07 Thread Bryan Stillwell

Dan, We recently went through an expansion of an RGW cluster and found that we needed 'norebalance' set whenever making CRUSH weight changes to avoid slow requests. We were also increasing the CRUSH weight by 1.0 each time which seemed to reduce the extra data movement we were seeing with smal

[ceph-users] Client features by IP?

2017-09-06 Thread Bryan Stillwell

I was reading this post by Josh Durgin today and was pretty happy to see we can get a summary of features that clients are using with the 'ceph features' command: http://ceph.com/community/new-luminous-upgrade-complete/ However, I haven't found an option to display the IP address of those clien

Re: [ceph-users] Client features by IP?

2017-09-07 Thread Bryan Stillwell

On 09/07/2017 10:47 AM, Josh Durgin wrote: > On 09/06/2017 04:36 PM, Bryan Stillwell wrote: > > I was reading this post by Josh Durgin today and was pretty happy to > > see we can get a summary of features that clients are using with the > > 'ceph features' c

[ceph-users] radosgw crashing after buffer overflows detected

2017-09-08 Thread Bryan Stillwell

For about a week we've been seeing a decent number of buffer overflows detected across all our RGW nodes in one of our clusters. This started happening a day after we started weighing in some new OSD nodes, so we're thinking it's probably related to that. Could someone help us determine the root

Re: [ceph-users] Client features by IP?

2017-09-08 Thread Bryan Stillwell

On 09/07/2017 01:26 PM, Josh Durgin wrote: > On 09/07/2017 11:31 AM, Bryan Stillwell wrote: >> On 09/07/2017 10:47 AM, Josh Durgin wrote: >>> On 09/06/2017 04:36 PM, Bryan Stillwell wrote: >>>> I was reading this post by Josh Durgin today and was pretty happy to &g

Re: [ceph-users] radosgw crashing after buffer overflows detected

2017-09-11 Thread Bryan Stillwell

lf of Bryan Stillwell Date: Friday, September 8, 2017 at 9:26 AM To: ceph-users Subject: [ceph-users] radosgw crashing after buffer overflows detected [This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing] For

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell

e --bypass-gc option to avoid the cleanup, but is there a way to speed up the gc once you're in this position? There were about 8M objects that were deleted from this bucket. I've come across a few references to the rgw-gc settings in the config, but nothing that explained the times w

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell

ryan From: Yehuda Sadeh-Weinraub Date: Wednesday, October 25, 2017 at 11:32 AM To: Bryan Stillwell Cc: David Turner , Ben Hines , "ceph-users@lists.ceph.com" Subject: Re: [ceph-users] Speeding up garbage collection in RGW Some of the options there won't do much for you as they&#x

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-27 Thread Bryan Stillwell

On Wed, Oct 25, 2017 at 4:02 PM, Yehuda Sadeh-Weinraub wrote: > > On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell > wrote: > > That helps a little bit, but overall the process would take years at this > > rate: > > > > # for i in {1..3600}; do ceph df -f jso

[ceph-users] Problems removing buckets with --bypass-gc

2017-10-31 Thread Bryan Stillwell

As mentioned in another thread I'm trying to remove several thousand buckets on a hammer cluster (0.94.10), but I'm running into a problem using --bypass-gc. I usually see either this error: # radosgw-admin bucket rm --bucket=sg2pl598 --purge-objects --bypass-gc 2017-10-31 09:21:04.111599 7f45f5

[ceph-users] Switching failure domains

2018-01-31 Thread Bryan Stillwell

We're looking into switching the failure domains on several of our clusters from host-level to rack-level and I'm trying to figure out the least impactful way to accomplish this. First off, I've made this change before on a couple large (500+ OSDs) OpenStack clusters where the volumes, images, and

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell

Bryan, Based off the information you've provided so far, I would say that your largest pool still doesn't have enough PGs. If you originally had only 512 PGs for you largest pool (I'm guessing .rgw.buckets has 99% of your data), then on a balanced cluster you would have just ~11.5 PGs per OSD

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell

It may work fine, but I would suggest limiting the number of operations going on at the same time. Bryan From: Bryan Banister Date: Tuesday, February 13, 2018 at 1:16 PM To: Bryan Stillwell , Janne Johansson Cc: Ceph Users Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Re: [ceph-users] v13.2.1 Mimic released

2018-07-27 Thread Bryan Stillwell

I decided to upgrade my home cluster from Luminous (v12.2.7) to Mimic (v13.2.1) today and ran into a couple issues: 1. When restarting the OSDs during the upgrade it seems to forget my upmap settings. I had to manually return them to the way they were with commands like: ceph osd pg-upmap-ite

Re: [ceph-users] rocksdb mon stores growing until restart

2018-09-19 Thread Bryan Stillwell

> On 08/30/2018 11:00 AM, Joao Eduardo Luis wrote: > > On 08/30/2018 09:28 AM, Dan van der Ster wrote: > > Hi, > > Is anyone else seeing rocksdb mon stores slowly growing to >15GB, > > eventually triggering the 'mon is using a lot of disk space' warning? > > Since upgrading to luminous, we've seen

[ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a problem where the new ceph-mgr would sometimes hang indefinitely when doing commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs). The rest of our clusters (10+) aren't seeing the same issue, but they ar

Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell

I left some of the 'ceph pg dump' commands running and twice they returned results after 30 minutes, and three times it took 45 minutes. Is there something that runs every 15 minutes that would let these commands finish? Bryan From: Bryan Stillwell Date: Thursday, October 18, 201

Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell

t. Anyone know the reasoning for that decision? Bryan From: Dan van der Ster Date: Thursday, October 18, 2018 at 2:03 PM To: Bryan Stillwell Cc: ceph-users Subject: Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous 15 minutes seems like the ms tcp read timeout would be rel

Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell

collectd which is running 'ceph pg dump' every 16-17 seconds. I guess you could say we're stress testing that code path fairly well... :) Bryan On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell mailto:bstillw...@godaddy.com>> wrote: After we upgraded from Jewel (10.

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread Bryan Stillwell

[mailto:drakonst...@gmail.com] Sent: Friday, February 16, 2018 3:21 PM To: Bryan Banister mailto:bbanis...@jumptrading.com>> Cc: Bryan Stillwell mailto:bstillw...@godaddy.com>>; Janne Johansson mailto:icepic...@gmail.com>>; Ceph Users mailto:ceph-users@lists.ceph.com>>

[ceph-users] RGW (Swift) failures during upgrade from Jewel to Luminous

2018-05-08 Thread Bryan Stillwell

We recently began our upgrade testing for going from Jewel (10.2.10) to Luminous (12.2.5) on our clusters. The first part of the upgrade went pretty smoothly (upgrading the mon nodes, adding the mgr nodes, upgrading the OSD nodes), however, when we got to the RGWs we started seeing internal server

Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Bryan Stillwell

> We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB disks > each to the cluster. All the 5 nodes rebalanced well without any issues and > the sixth/last node OSDs started acting weird as I increase weight of one osd > the utilization doesn't change but a different osd on the s

[ceph-users] Living with huge bucket sizes

2017-06-08 Thread Bryan Stillwell

This has come up quite a few times before, but since I was only working with RBD before I didn't pay too close attention to the conversation. I'm looking for the best way to handle existing clusters that have buckets with a large number of objects (>20 million) in them. The cluster I'm doing test

Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Bryan Stillwell

Is this on an RGW cluster? If so, you might be running into the same problem I was seeing with large bucket sizes: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html The solution is to shard your buckets so the bucket index doesn't get too big. Bryan From: ceph-users o

[ceph-users] Directory size doesn't match contents

2017-06-14 Thread Bryan Stillwell

I have a cluster running 10.2.7 that is seeing some extremely large directory sizes in CephFS according to the recursive stats: $ ls -lhd Originals/ drwxrwxr-x 1 bryan bryan 16E Jun 13 13:27 Originals/ du reports a much smaller (and accurate) number: $ du -sh Originals/ 300GOriginals/ This

Re: [ceph-users] Directory size doesn't match contents

2017-06-15 Thread Bryan Stillwell

On 6/15/17, 9:20 AM, "John Spray" wrote: > > On Wed, Jun 14, 2017 at 4:31 PM, Bryan Stillwell > wrote: > > I have a cluster running 10.2.7 that is seeing some extremely large > > directory sizes in CephFS according to the recursive stats: > > > >

Re: [ceph-users] Removing orphaned radosgw bucket indexes from pool

2018-11-29 Thread Bryan Stillwell

Wido, I've been looking into this large omap objects problem on a couple of our clusters today and came across your script during my research. The script has been running for a few hours now and I'm already over 100,000 'orphaned' objects! It appears that ever since upgrading to Luminous (12.2

[ceph-users] Compacting omap data

2019-01-02 Thread Bryan Stillwell

Recently on one of our bigger clusters (~1,900 OSDs) running Luminous (12.2.8), we had a problem where OSDs would frequently get restarted while deep-scrubbing. After digging into it I found that a number of the OSDs had very large omap directories (50GiB+). I believe these were OSDs that had p

Re: [ceph-users] Omap issues - metadata creating too many

2019-01-03 Thread Bryan Stillwell

f Zelenka Date: Thursday, January 3, 2019 at 3:49 AM To: "J. Eric Ivancich" Cc: "ceph-users@lists.ceph.com" , Bryan Stillwell Subject: Re: [ceph-users] Omap issues - metadata creating too many Hi, i had the default - so it was on(according to ceph kb). turned it off, but the iss

[ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-07 Thread Bryan Stillwell

I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't cleaning up old osdmaps after doing an expansion. This is even after the cluster became 100% active+clean: # find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l 46181 With the osdmaps being over 600KB i

Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-07 Thread Bryan Stillwell

I believe the option you're looking for is mon_data_size_warn. The default is set to 16106127360. I've found that sometimes the mons need a little help getting started with trimming if you just completed a large expansion. Earlier today I had a cluster where the mon's data directory was over

Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-08 Thread Bryan Stillwell

solution Dan came across back in the hammer days. It works, but not ideal for sure. Across the cluster it freed up around 50TB of data! Bryan From: ceph-users on behalf of Bryan Stillwell Date: Monday, January 7, 2019 at 2:40 PM To: ceph-users Subject: [ceph-users] osdmaps not being cleaned

Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell

#x27;re seeing up to 49,272 osdmaps hanging around. The churn trick seems to be working again too. Bryan From: Dan van der Ster Date: Thursday, January 10, 2019 at 3:13 AM To: Bryan Stillwell Cc: ceph-users Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8 Hi Bryan, I think th

Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell

I've created the following bug report to address this issue: http://tracker.ceph.com/issues/37875 Bryan From: ceph-users on behalf of Bryan Stillwell Date: Friday, January 11, 2019 at 8:59 AM To: Dan van der Ster Cc: ceph-users Subject: Re: [ceph-users] osdmaps not being cleaned

[ceph-users] Fixing a broken bucket index in RGW

2019-01-16 Thread Bryan Stillwell

I'm looking for some help in fixing a bucket index on a Luminous (12.2.8) cluster running on FileStore. First some background on how I believe the bucket index became broken. Last month we had a PG in our .rgw.buckets.index pool become inconsistent: 2018-12-11 09:12:17.743983 osd.1879 osd.1879 1

[ceph-users] Rebuilding RGW bucket indices from objects

2019-01-17 Thread Bryan Stillwell

This is sort of related to my email yesterday, but has anyone ever rebuilt a bucket index using the objects themselves? It seems to be that it would be possible since the bucket_id is contained within the rados object name: # rados -p .rgw.buckets.index listomapkeys .dir.default.56630221.139618

Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Bryan Stillwell

Since you're using jumbo frames, make sure everything between the nodes properly supports them (nics & switches). I've tested this in the past by using the size option in ping (you need to use a payload size of 8972 instead of 9000 to account for the 28 byte header): ping -s 8972 192.168.160.

Re: [ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Bryan Stillwell

When you use 3+2 EC that means you have 3 data chunks and 2 erasure chunks for your data. So you can handle two failures, but not three. The min_size setting is preventing you from going below 3 because that's the number of data chunks you specified for the pool. I'm sorry to say this, but si

Re: [ceph-users] Suggestions/experiences with mixed disk sizes and models from 4TB - 14TB

2019-01-17 Thread Bryan Stillwell

I've run my home cluster with drives ranging in size from 500GB to 8TB before and the biggest issue you run into is that the bigger drives will get a proportional more number of PGs which will increase the memory requirements on them. Typically you want around 100 PGs/OSD, but if you mix 4TB an

[ceph-users] Is repairing an RGW bucket index broken?

2019-03-11 Thread Bryan Stillwell

I'm wondering if the 'radosgw-admin bucket check --fix' command is broken in Luminous (12.2.8)? I'm asking because I'm trying to reproduce a situation we have on one of our production clusters and it doesn't seem to do anything. Here's the steps of my test: 1. Create a bucket with 1 million o

[ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell

We have two separate RGW clusters running Luminous (12.2.8) that have started seeing an increase in PGs going active+clean+inconsistent with the reason being caused by an omap_digest mismatch. Both clusters are using FileStore and the inconsistent PGs are happening on the .rgw.buckets.index poo

Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell

> On Apr 8, 2019, at 4:38 PM, Gregory Farnum wrote: > > On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell wrote: >> >> There doesn't appear to be any correlation between the OSDs which would >> point to a hardware issue, and since it's happening on two di

Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-09 Thread Bryan Stillwell

> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell wrote: > > >> On Apr 8, 2019, at 4:38 PM, Gregory Farnum wrote: >> >> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell >> wrote: >>> >>> There doesn't appear to be any correlation between

Re: [ceph-users] Ceph OSD node trying to possibly start OSDs that were purged

2019-10-29 Thread Bryan Stillwell

On Oct 29, 2019, at 11:23 AM, Jean-Philippe Méthot wrote: > A few months back, we had one of our OSD node motherboards die. At the time, > we simply waited for recovery and purged the OSDs that were on the dead node. > We just replaced that node and added back the drives as new OSDs. At the cep

Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Bryan Stillwell

Jelle, Try putting just the WAL on the Optane NVMe. I'm guessing your DB is too big to fit within 5GB. We used a 5GB journal on our nodes as well, but when we switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs. A

81 matches

Mail list logo