[ceph-users] Mon crashes virtual void LogMonitor::update_from_paxos(bool*)

2020-01-15 Thread Kevin Hrpcek
4) [0x55e5a0326784] 7: (main()+0x2611) [0x55e5a021b2b1] 8: (__libc_start_main()+0xf5) [0x7f09397c6505] 9: (()+0x24ad40) [0x55e5a02fad40] -261> 2020-01-15 16:36:46.086 7f0946674a00 -1 *** Caught signal (Aborted) ** in thread 7f0946674a00 thread_name:ceph-mon

[ceph-users] January Ceph Science Group Virtual Meeting

2020-01-13 Thread Kevin Hrpcek
t.com/en/conference-numbers&sa=D&ust=1579363980705000&usg=AOvVaw2aHSwR3wGU0yTs-bCsUFoC> 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111&sa=D&ust=1579363980705000&

[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
ting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111&sa=D&ust=1572095869727000&usg=AOvVaw1bRfUtekflHoeS36FKwXw2> -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineerin

Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
cheduled the next meeting on the community calendar for August 28 at 14:30 UTC. Each meeting will then take place on the last Wednesday of each month. Here's the pad to collect agenda/notes: https://pad.ceph.com/p/Ceph_Science_User_Group_Index -- Mike Perez (thingee) On Tue, Jul 23, 2019

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
rush weight? Or the reweight? (I guess you change the crush weight, I am right?) Thanks! El 24 jul 2019, a les 19:17, Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> va escriure: I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously chang

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight increase steps to what you are comfortable with. This has worked well for me and my workloads. I've sometimes seen peering take longer if I do steps too quickly but I don't run any

Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-23 Thread Kevin Hrpcek
Update We're going to hold off until August for this so we can promote it on the Ceph twitter with more notice. Sorry for the inconvenience if you were planning on the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for updates. Kevin On 7/5/19 11:15 PM, Kevin H

Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-05 Thread Kevin Hrpcek
olunteer a topic for meetings. I will be brainstorming some conversation starters but it would also be interesting to have people give a deep dive into their use of ceph and what they have built around it to support the science being done at their facility. Kevin On 6/17/19 10:43 AM, Ke

[ceph-users] Ceph Scientific Computing User Group

2019-06-17 Thread Kevin Hrpcek
Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together mont

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Kevin Hrpcek
lates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown se

Re: [ceph-users] Mimic upgrade failure

2018-09-12 Thread Kevin Hrpcek
;d also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier aft

Re: [ceph-users] Mimic upgrade failure

2018-09-09 Thread Kevin Hrpcek
seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy

Re: [ceph-users] Mimic upgrade failure

2018-09-09 Thread Kevin Hrpcek
back up in mimic. Depending on how bad things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either. This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening? O

Re: [ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek
age On Sat, 8 Sep 2018, Kevin Hrpcek wrote: Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cl

[ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek
Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any commands on it since

Re: [ceph-users] separate monitoring node

2018-06-20 Thread Kevin Hrpcek
7;d set up the VM with ceph-common, the conf and a restricted keyring then have icinga2 run a nrpe check on it that calls the check_ceph, ceph -s, or whaterver. Kevin On 06/19/2018 04:13 PM, Denny Fuchs wrote: hi, Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek : # ceph auth get client.icinga e

Re: [ceph-users] separate monitoring node

2018-06-19 Thread Kevin Hrpcek
I use icinga2 as well with a check_ceph.py that I wrote a couple years ago. The method I use is that icinga2 runs the check from the icinga2 host itself. ceph-common is installed on the icinga2 host since the check_ceph script is a wrapper and parser for the ceph command output using python's s

[ceph-users] Reweighting causes whole cluster to peer/activate

2018-06-14 Thread Kevin Hrpcek
Hello, I'm seeing something that seems to be odd behavior when reweighting OSDs. I've just upgraded to 12.2.5 and am adding in a new osd server to the cluster. I gradually weight the 10TB OSDs into the cluster by doing a +1, letting things backfill for a while, then +1 until I reach my desire

Re: [ceph-users] librados python pool alignment size write failures

2018-04-03 Thread Kevin Hrpcek
Thanks for the input Greg, we've submitted the patch to the ceph github repo https://github.com/ceph/ceph/pull/21222 Kevin On 04/02/2018 01:10 PM, Gregory Farnum wrote: On Mon, Apr 2, 2018 at 8:21 AM Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Hello, We

[ceph-users] librados python pool alignment size write failures

2018-04-02 Thread Kevin Hrpcek
re just doing things different compared to most users... Any insight would be appreciated as we'd prefer to use an official solution rather than our bindings fix for long term use. Tested on Luminous 12.2.2 and 12.2.4. Thanks, Kevin -- Kevin Hrpcek Linux Systems Administrator NASA SNPP

Re: [ceph-users] Ceph luminous - throughput performance issue

2018-01-31 Thread Kevin Hrpcek
Steven, I've recently done some performance testing on dell hardware. Here are some of my messy results. I was mainly testing the effects of the R0 stripe sizing on the perc card. Each disk has it's own R0 so that write back is enabled. VDs were created like this but with different stripesize

Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Kevin Hrpcek
Marc, If you're running luminous you may need to increase osd_max_object_size. This snippet is from the Luminous change log. "The default maximum size for a single RADOS object has been reduced from 100GB to 128MB. The 100GB limit was completely impractical in practice while the 128MB limit

Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-06 Thread Kevin Hrpcek
quickly setting nodown,noout,noup when everything is already down will help as well. Sage, thanks again for your input and advice. Kevin On 11/04/2017 11:54 PM, Sage Weil wrote: On Sat, 4 Nov 2017, Kevin Hrpcek wrote: Hey Sage, Thanks for getting back to me this late on a weekend. Do you

Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek
Hey Sage, Thanks for getting back to me this late on a weekend. Do you now why the OSDs were going down? Are there any crash dumps in the osd logs, or is the OOM killer getting them? That's a part I can't nail down yet. OSDs didn't crash, after the reweight-by-utilization OSDs on some of our e

[ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek
backfill_wait 2 stale+activating 1 stale+active+clean+scrubbing 1 active+recovering+undersized+degraded 1 stale+active+remapped+backfilling 1 inactive 1 active+clean+scrubbing