date:20170911

[ceph-users] Oeps: lost cluster with: ceph osd require-osd-release luminous

2017-09-11 Thread Jan-Willem Michels

We have a kraken cluster, at the time newly build, with bluestore enabled. it is 8 systems, with each 10 disks 10TB , and each computer has 1 NVME 2TB disk 3 monitor etc About 700 TB and 300TB used. Mainly S3 objectstore Of course there is more to the story: We have one strange thing in our

[ceph-users] upgrade Hammer>Jewel>Luminous OSD fail to start

2017-09-11 Thread kevin parrikar

hello All, I am trying to upgrade a small test setup having one monitor and one osd node which is in hammer release . I updating from hammer to jewel using package update commands and things are working. How ever after updating from Jewel to Luminous, i am facing issues with osd failing to start

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Brad Hubbard

On Tue, Sep 12, 2017 at 3:12 PM, Katie Holly wrote: > Ben and Brad, > > big thanks to both of you for helping me track down this issue which - > seemingly - was caused by more than one radosgw instance sharing the exact > same --name value and solved by generating unique keys and --name values f

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Katie Holly

Ben and Brad, big thanks to both of you for helping me track down this issue which - seemingly - was caused by more than one radosgw instance sharing the exact same --name value and solved by generating unique keys and --name values for each single radosgw instance. Right now, all ceph-mgr dae

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Katie Holly

They all share the exact same exec arguments, so yes, they all have the same --name as well. I'll try to run them with different --name parameters to see if that solves the issue. -- Katie On 2017-09-12 06:13, Ben Hines wrote: > Do the docker containers all have the same rgw --name ? Maybe that

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Brad Hubbard

On Tue, Sep 12, 2017 at 2:11 PM, Katie Holly wrote: > All radosgw instances are running >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) > as Docker containers, there are 15 of them at any possible time > > > The "config"/exec-args for the radosgw instances are: > >

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Katie Holly

All radosgw instances are running > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) as Docker containers, there are 15 of them at any possible time The "config"/exec-args for the radosgw instances are: /usr/bin/radosgw \ -d \ --cluster=ceph \ --conf=/dev/null

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Brad Hubbard

It seems like it's choking on the report from the rados gateway. What version is the rgw node running? If possible, could you shut down the rgw and see if you can then start ceph-mgr? Pure stab in the dark just to see if the problem is tied to the rgw instance. On Tue, Sep 12, 2017 at 1:07 PM, K

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Katie Holly

Thanks, I totally forgot to check the tracker. I added the information I collected there, but don't have enough experience with ceph to dig through this myself so let's see if someone is willing to sacrifice their free time to help debugging this issue. -- Katie On 2017-09-12 03:15, Brad Hubba

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Ashley Merrick

Could this also have an effect on kRBD client's? If so what ceph auth caps command should we use? From: ceph-users on behalf of Jason Dillaman Sent: 11 September 2017 22:00:47 To: Nico Schottelius Cc: ceph-users; kamila.souck...@ungleich.ch Subject: Re: [ceph-

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Brad Hubbard

Looks like there is a tracker opened for this. http://tracker.ceph.com/issues/21197 Please add your details there. On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly wrote: > Hi, > > I recently upgraded one of our clusters from Kraken to Luminous (the cluster > was initialized with Jewel) on Ubuntu

[ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

2017-09-11 Thread Katie Holly

Hi, I recently upgraded one of our clusters from Kraken to Luminous (the cluster was initialized with Jewel) on Ubuntu 16.04 and deployed ceph-mgr on all of our ceph-mon nodes with ceph-deploy. Related log entries after initial deployment of ceph-mgr: 2017-09-11 06:41:53.535025 7fb5aa7b8500 0

Re: [ceph-users] Ceph release cadence

2017-09-11 Thread Blair Bethwaite

(Apologies if this is a double post - I think my phone turned it into HTML and so bounced from ceph-devel)... We currently use both upstream and distro (RHCS) versions on different clusters. Downstream releases are still free to apply their own models. I like the idea of a predictable (and more r

Re: [ceph-users] Ceph release cadence

2017-09-11 Thread Blair Bethwaite

On 7 September 2017 at 01:23, Sage Weil wrote: > * Drop the odd releases, and aim for a ~9 month cadence. This splits the > difference between the current even/odd pattern we've been doing. > > + eliminate the confusing odd releases with dubious value > + waiting for the next release isn't quite a

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-11 Thread Brad Hubbard

Take a look at these which should answer at least some of your questions. http://ceph.com/community/new-luminous-bluestore/ http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/ On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh wrote: > On 08/09/17 11:44, Richard Hesketh wrot

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Blair Bethwaite

Great to see this issue sorted. I have to say I am quite surprised anyone would implement the export/import workaround mentioned here without *first* racing to this ML or IRC and crying out for help. This is a valuable resource, made more so by people sharing issues. Cheers, On 12 September 2017

Re: [ceph-users] [SOLVED] output discards (queue drops) on switchport

2017-09-11 Thread Blair Bethwaite

On 12 September 2017 at 01:15, Blair Bethwaite wrote: > Flow-control may well just mask the real problem. Did your throughput > improve? Also, does that mean flow-control is on for all ports on the > switch...? IIUC, then such "global pause" flow-control will mean switchports > with links to up

Re: [ceph-users] Ceph release cadence

2017-09-11 Thread Ben Hines

We have generally been running the latest non LTS 'stable' release since my cluster is slightly less mission critical than others, and there were important features to us added in both Infernalis and Kraken. But i really only care about RGW. If the rgw component could be split out of ceph into a pl

Re: [ceph-users] Ceph release cadence

2017-09-11 Thread John Spray

On Wed, Sep 6, 2017 at 4:23 PM, Sage Weil wrote: > Hi everyone, > > Traditionally, we have done a major named "stable" release twice a year, > and every other such release has been an "LTS" release, with fixes > backported for 1-2 years. > > With kraken and luminous we missed our schedule by a lot

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

For opennebula this would be http://docs.opennebula.org/5.4/deployment/open_cloud_storage_setup/ceph_ds.html (added opennebula in CC) Jason Dillaman writes: > Yes -- the upgrade documentation definitely needs to be updated to add > a pre-monitor upgrade step to verify your caps before proceed

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

Yes -- the upgrade documentation definitely needs to be updated to add a pre-monitor upgrade step to verify your caps before proceeding -- I will take care of that under this ticket [1]. I believe the OpenStack documentation has been updated [2], but let me know if you find other places. [1] http:

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

That indeed worked! Thanks a lot! The remaining question from my side: did we do anything wrong in the upgrade process and if not, should it be documented somewhere how to setup the permissions correctly on upgrade? Or should the documentation on the side of the cloud infrastructure software be

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

Since you have already upgraded to Luminous, the fastest and probably easiest way to fix this is to run "ceph auth caps client.libvirt mon 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides simplified RBD caps via named profiles which ensure all the correct permissions are enabled. [

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Hey Jason, here it is: [22:42:12] server4:~# ceph auth get client.libvirt exported keyring for client.libvirt [client.libvirt] key = ... caps mgr = "allow r" caps mon = "allow r" caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=one" [22:52:

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

The only errors message I see is from dmesg when trying to accessing the XFS filesystem (see attached image). Let me know if you need any more logs - luckily I can spin up this VM in a broken state as often as you want to :-) Jason Dillaman writes: > ... also, do have any logs from the OS ass

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

I see the following which is most likely the issue: 2017-09-11 22:26:38.945776 7efd677fe700 -1 librbd::managed_lock::BreakRequest: 0x7efd58020e70 handle_blacklist: failed to blacklist lock owner: (13) Permission denied 2017-09-11 22:26:38.945795 7efd677fe700 10 librbd::managed_lock::BreakRequest:

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Thanks a lot for the great ceph.conf pointer, Mykola! I found something interesting: 2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter ms_dispatch 0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8 2017-09-11 22:26:2

Re: [ceph-users] Ceph release cadence

2017-09-11 Thread Nathan Cutler

From a backporter's perspective, the appealing options are the ones that reduce the number of stable releases in maintenance at any particular time. In the current practice, there are always at least two LTS releases, and sometimes a non-LTS release as well, that are "live" and supposed to be

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

... also, do have any logs from the OS associated w/ this log file? I am specifically looking for anything to indicate which sector was considered corrupt. On Mon, Sep 11, 2017 at 4:41 PM, Jason Dillaman wrote: > Thanks -- I'll take a look to see if anything else stands out. That > "Exec format e

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

Thanks -- I'll take a look to see if anything else stands out. That "Exec format error" isn't actually an issue -- but now that I know about it, we can prevent it from happening in the future [1] [1] http://tracker.ceph.com/issues/21360 On Mon, Sep 11, 2017 at 4:32 PM, Nico Schottelius wrote: >

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-11 Thread Florian Haas

On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick wrote: > > On 2017-09-08 06:06 PM, Gregory Farnum wrote: > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick > > wrote: > > > >> On a related note, we are very curious why the snapshot id is > >> incremented when a snapshot is deleted, this create

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-11 Thread Mclean, Patrick

On 2017-09-08 06:06 PM, Gregory Farnum wrote: > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick > wrote: > >> On a related note, we are very curious why the snapshot id is >> incremented when a snapshot is deleted, this creates lots >> phantom entries in the deleted snapshots set. Interleaved >>

Re: [ceph-users] OSD memory usage

2017-09-11 Thread bulk . schulz

Please excuse my brain-fart. We're using 24 disks on the servers in question. Only after discussing this further with a colleague did we realize this. This brings us right to the minimum-spec which generally isn't a good idea. Sincerely -Dave On 11/09/17 11:38 AM, bulk.sch...@ucalgary.ca

[ceph-users] objects degraded higher than 100%

2017-09-11 Thread Andreas Herrmann

Hi, how could this happen: pgs: 197528/1524 objects degraded (12961.155%) I did some heavy failover tests, but a value higher than 100% looks strange (ceph version 12.2.0). Recovery is quite slow. cluster: health: HEALTH_WARN 3/1524 objects misplaced (0.197%)

[ceph-users] OSD memory usage

2017-09-11 Thread bulk . schulz

Hi Everyone, I wonder if someone out there has a similar problem to this? I keep having issues with memory usage. I have 2 OSD servers wiith 48G memory and 12 2TB OSDs. I seem to have significantly more memory than the minimum spec, but these two machines with 2TB drives seem to OOM kill an

[ceph-users] Radosgw: object lifecycle (expiration) not working?

2017-09-11 Thread Robert Stanford

Greetings - I have created several test buckets in radosgw, to test different expiration durations: $ s3cmd mb s3://test2d I set a lifecycle for each of these buckets: $ s3cmd setlifecycle lifecycle2d.xml s3://test2d --signature-v2 The files look like this: http://s3.amazonaws.com/doc/

Re: [ceph-users] radosgw crashing after buffer overflows detected

2017-09-11 Thread Bryan Stillwell

I found a couple OSDs that were seeing medium errors and marked them out of the cluster. Once all the PGs were moved off those OSDs all the buffer overflows went away. So there must be some kind of bug that's being triggered when an OSD is misbehaving. Bryan From: ceph-users on behalf of Bryan

Re: [ceph-users] [SOLVED] output discards (queue drops) on switchport

2017-09-11 Thread Blair Bethwaite

Flow-control may well just mask the real problem. Did your throughput improve? Also, does that mean flow-control is on for all ports on the switch...? IIUC, then such "global pause" flow-control will mean switchports with links to upstream network devices will also be paused if the switch is attemp

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Sarunas Burdulis

On 2017-09-11 09:31, Nico Schottelius wrote: > > Sarunas, > > may I ask when this happened? I was following http://docs.ceph.com/docs/master/release-notes/#upgrade-from-jewel-or-kraken I can't tell which step in particular the issue with VMs was triggered by. > And did you move OSDs or mons a

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Hey Mykola, thanks for the hint, I will test this in a few hours when I'm back on a regular Internet connection! Best, Nico Mykola Golub writes: > On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: >> >> Just tried and there is not much more log in ceph -w (see below) neither

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Jason Dillaman

Definitely would love to see some debug-level logs (debug rbd = 20 and debug objecter = 20) for any VM that experiences this issue. The only thing I can think of is something to do with sparse object handling since (1) krbd doesn't perform sparse reads and (2) re-importing the file would eliminate

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Sarunas, may I ask when this happened? And did you move OSDs or mons after that export/import procecdure? I really wonder, what is the reason for this behaviour and also if it is likely to experience it again. Best, Nico Sarunas Burdulis writes: > On 2017-09-10 08:23, Nico Schottelius wrot

Re: [ceph-users] [SOLVED] output discards (queue drops) on switchport

2017-09-11 Thread Andreas Herrmann

Hi, flow control was active on the NIC but not on the switch. Enabling flowcontrol for both direction solved the problem: flowcontrol receive on flowcontrol send on PortSend FlowControl Receive FlowControl RxPause TxPause adminoper adminope

Re: [ceph-users] Ceph 12.2.0 on 32bit?

2017-09-11 Thread Gregory Farnum

You could try setting it to run with SimpleMessenger instead of AsyncMessenger -- the default changed across those releases. I imagine the root of the problem though is that with BlueStore the OSD is using a lot more memory than it used to and so we're overflowing the 32-bit address space...which m

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Sarunas Burdulis

On 2017-09-10 08:23, Nico Schottelius wrote: > > Good morning, > > yesterday we had an unpleasant surprise that I would like to discuss: > > Many (not all!) of our VMs were suddenly > dying (qemu process exiting) and when trying to restart them, inside the > qemu process we saw i/o errors on the

Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement

2017-09-11 Thread Дробышевский , Владимир

Have you tried to list your ceph keys with "/usr/bin/ceph config-key ls" ? 2017-09-11 15:56 GMT+05:00 M Ranga Swami Reddy : > ceph-disk --prepare -dmcrypt --> cmd, where this command store the > keys for bmcrypt? > > default as per the docs - /etc/ceph/dmcrypt-keys -> but this directory is > empt

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Mykola Golub

On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote: > > Just tried and there is not much more log in ceph -w (see below) neither > from the qemu process. > > [15:52:43] server4:~$ /usr/bin/qemu-system-x86_64 -name one-17031 -S > -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -rea

Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement

2017-09-11 Thread M Ranga Swami Reddy

ceph-disk --prepare -dmcrypt --> cmd, where this command store the keys for bmcrypt? default as per the docs - /etc/ceph/dmcrypt-keys -> but this directory is empty. Thanks Swami On Sat, Sep 9, 2017 at 4:34 PM, Дробышевский, Владимир wrote: > AFAIK in case of dm-crypt luks (as default) ceph-dis

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-11 Thread Richard Hesketh

On 08/09/17 11:44, Richard Hesketh wrote: > Hi, > > Reading the ceph-users list I'm obviously seeing a lot of people talking > about using bluestore now that Luminous has been released. I note that many > users seem to be under the impression that they need separate block devices > for the blue

Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-11 Thread donglifec...@gmail.com

ZhengYan, It is ok. Thank you very much. Directory fragmentation can be used in production ? Can you give me some advice about cephfs ? Thanks a lot. donglifec...@gmail.com From: Yan, Zheng Date: 2017-09-11 16:00 To: donglifec...@gmail.com CC: ceph-users Subject: Re: [ceph-users]cephfs

Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-11 Thread Дробышевский , Владимир

Hello, Alexandre! Do you have any testing methodology to share? I have a fresh test luminous 12.2.0 cluster with 4 nodes with 1 x 1.92TB Samsung sm863 + Infiniband each with unsupported setup (with co-located system\mon\osd partition and bluestore partition on the same drive created with modifie

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Good morning Lionel, it's great to hear that it's not only us being affected! I am not sure what you refer to by "glance" images, but what we see is that we can spawn a new VM based on an existing image and that one runs. Can I invite you (and anyone else who has problems w/ Luminous upgrade) t

[ceph-users] Recovering from scrub errors in bluestore

2017-09-11 Thread Daniel Schreiber

Hi, recently I upgraded a test cluster from 10.2.9 to 12.2.0. When that was done, I converted all OSDs from filestore to bluestore. Today, ceph reported a scrub error in the cephfs metadata pool: ceph health detail HEALTH_ERR 6 scrub errors; Possible data damage: 2 pgs inconsistent OSD_SCRUB_

Re: [ceph-users] output discards (queue drops) on switchport

2017-09-11 Thread Andreas Herrmann

I don't think it's Mellanox problem. Output drops at switchports are also seen when using a Intel 10 GBit/s NIC. On 08.09.2017 17:41, Alexandre DERUMIER wrote: > Sorry, I dind't see that you use proxmox5. > > As I'm a proxmox contributor, I can tell you that I have error with kernel > 4.10 (w

Re: [ceph-users] output discards (queue drops) on switchport

2017-09-11 Thread Andreas Herrmann

Hi, On 08.09.2017 16:25, Burkhard Linke wrote: >>> Regarding the drops (and without any experience with neither 25GBit ethernet >>> nor the Arista switches): >>> Do you have corresponding input drops on the server's network ports? >> No input drops, just output drop > Output drops on the switch ar

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Beard Lionel (BOSTON-STORAGE)

Hi, We also have the same issue with Openstack instances (QEMU/libvirt) after upgrading from kraken to luminous, and just after starting osd migration from btrfs to bluestore. We were able to restart failed VMs by mounting all disks from a linux box with rbd map, and run fsck on them. QEMU host

Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-11 Thread Yan, Zheng

looks like http://tracker.ceph.com/issues/18314. please try: run "ceph fs set cephfs1 allow_dirfrags 1” and set mds config mds_bal_frag to 1, set mds_bal_split_size to 5000, and set mds_bal_fragment_size_max 5. > On 11 Sep 2017, at 15:46, donglifec...@gmail.com wrote: > > ZhengYan, >

Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-11 Thread donglifec...@gmail.com

ZhengYan, kernel client is 4.12.0. [root@yj43959-ceph-dev ~]# uname -a Linux yj43959-ceph-dev.novalocal 4.12.0-1.el7.elrepo.x86_64 #1 SMP Sun Jul 2 20:38:48 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Thanks a lot. donglifec...@gmail.com From: Yan, Zheng Date: 2017-09-11 15:24 To: donglifec...

Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-11 Thread Yan, Zheng

> On 11 Sep 2017, at 14:07, donglifec...@gmail.com wrote: > > ZhengYan, > > I set "mds_bal_fragment_size_max = 10, mds_bal_frag = true", then I > write 10 files named 512k.file$i, but there are still some file is > missing. such as : > [root@yj43959-ceph-dev cephfs]# find ./volumes/

59 matches

Mail list logo