Re: [ceph-users] Calamari server not working after upgrade 0.87-1 -> 0.94-1

2015-04-27 Thread Steffen W Sørensen
> On 27/04/2015, at 15.51, Alexandre DERUMIER wrote: > > Hi, can you check on your ceph node > /var/log/salt/minion ? > > I have had some similar problem, I have need to remove > > rm /etc/salt/pki/minion/minion_master.pub > /etc/init.d/salt-minion restart > > (I don't known if "calamari-ctl

[ceph-users] [cephfs][ceph-fuse] cache size or memory leak?

2015-04-27 Thread Dexter Xiong
Hi, I've deployed a small hammer cluster 0.94.1. And I mount it via ceph-fuse on Ubuntu 14.04. After several hours I found that the ceph-fuse process crashed. The end is the crash log from /var/log/ceph/ceph-client.admin.log. The memory cost of ceph-fuse process was huge(more than 4GB) when it

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Just to add some more interesting behavior to my problem, is that monitors are not updating the status of OSD's. Even when I stop all the remaining OSD's, ceph osd tree shows them as up. Also there the status of mons and mds doesn't seem to update correctly in my opinion. Below is a copy of statu

[ceph-users] about rgw region and zone

2015-04-27 Thread TERRY
Hi: all when I Configuring Federated Gateways?? I got the error as below: sudo radosgw-agent -c /etc/ceph/ceph-data-sync.conf ERROR:root:Could not retrieve region map from destination Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/radosgw_agent/cli.py", lin

Re: [ceph-users] IOWait on SATA-backed with SSD-journals

2015-04-27 Thread Gregory Farnum
On Sat, Apr 25, 2015 at 11:36 PM, Josef Johansson wrote: > Hi, > > With inspiration from all the other performance threads going on here, I > started to investigate on my own as well. > > I’m seeing a lot iowait on the OSD, and the journal utilised at 2-7%, with > about 8-30MB/s (mostly around 8

[ceph-users] v0.87.2 released

2015-04-27 Thread Sage Weil
This is the second (and possibly final) point release for Giant. We recommend all v0.87.x Giant users upgrade to this release. Notable Changes --- * ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman) * common: do not implicitly unlock r

Re: [ceph-users] Ceph Radosgw multi zone data replication failure

2015-04-27 Thread Vickey Singh
Hello Alfredo / Craig First of all Thank You So much for replying and giving your precious time to this problem. @Alfredo : I tried version radosgw-agent version 1.2.2 and the case has progressed a lot. ( below are some the logs ) I am now getting *2015-04-28 00:35:14,781 5132 [radosgw_agent][

Re: [ceph-users] Shadow Files

2015-04-27 Thread Ben
How long are you thinking here? We added more storage to our cluster to overcome these issues, and we can't keep throwing storage at it until the issues are fixed. On 28/04/15 01:49, Yehuda Sadeh-Weinraub wrote: It will get to the ceph mainline eventually. We're still reviewing and testing t

Re: [ceph-users] Ceph Radosgw multi zone data replication failure

2015-04-27 Thread Alfredo Deza
Hi Vickey (and all) It looks like this issue was introduced as part of the 1.2.1 release. I just finished getting 1.2.2 out (try upgrading please). You should no longer see that error. Hope that helps! -Alfredo - Original Message - From: "Craig Lewis" To: "Vickey Singh" Cc: ceph-use

Re: [ceph-users] Ceph Radosgw multi zone data replication failure

2015-04-27 Thread Craig Lewis
> [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-east-1 > [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-west-1 Are you trying to setup two zones on one cluster? That's possible, but you'll also want to spend some time on your CRUSH map making sure that the two zones are as ind

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Hi Updated the logfile, same place http://beta.xaasbox.com/ceph/ceph-osd.15.log Br, Tuomas -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: 27. huhtikuuta 2015 22:22 To: Tuomas Juntunen Cc: ceph-users@lists.ceph.com Subject: RE: [ceph-users] Upgrade from Giant to Hamm

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Sage Weil
On Mon, 27 Apr 2015, Tuomas Juntunen wrote: > Hey > > Got the log, you can get it from > http://beta.xaasbox.com/ceph/ceph-osd.15.log Can you repeat this with 'debug osd = 20'? Thanks! sage > > Br, > Tuomas > > > -Original Message- > From: Sage Weil [mailto:sw...@redhat.com] > Sen

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Hey Got the log, you can get it from http://beta.xaasbox.com/ceph/ceph-osd.15.log Br, Tuomas -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: 27. huhtikuuta 2015 20:45 To: Tuomas Juntunen Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Upgrade from Giant to H

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Somnath Roy
Yes, the tcmalloc patch we applied is not to solve the trace we are seeing. The env varaiable code path was noop in the tcmalloc code base and the patch has resolved that. Now, setting the env variable is taking effect within tcmalloc code base. Now, this thread cache env variable is a performan

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Hi Thank you so much, Here's the other json file, I'll check and install that and get the logs asap too. There has not been any snaps on rbd, I haven't used it at all, it has been just an empty pool. Br, Tuomas -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: 27. huh

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Sage Weil
Yeah, no snaps: images: "snap_mode": "selfmanaged", "snap_seq": 0, "snap_epoch": 17882, "pool_snaps": [], "removed_snaps": "[]", img: "snap_mode": "selfmanaged", "snap_seq": 0, "snap_epoch": 0,

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Mark Nelson
Hi Somnath, Forgive me as I think this was discussed earlier in the thread, but did we confirm that the patch/fix/etc does not 100% fix the problem? Mark On 04/27/2015 12:25 PM, Somnath Roy wrote: Alexandre, The moment you restarted after hitting the tcmalloc trace, irrespective of what val

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Somnath Roy
Alexandre, The moment you restarted after hitting the tcmalloc trace, irrespective of what value you set as thread cache, it will perform better and that's what happening in your case I guess. Yes, setting this value kind of tricky and very much dependent on your setup/workload etc. I would sugg

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Alexandre DERUMIER
Ok, just to make sure that I understand: >>tcmalloc un-tuned: ~50k IOPS once bug sets in yes, it's really random, but when hitting the bug, yes this is the worste I have seen. >>tcmalloc with patch and 128MB thread cache bytes: ~195k IOPS yes >>jemalloc un-tuned: ~150k IOPS It's more around 185

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Hi Here you go Br, Tuomas -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: 27. huhtikuuta 2015 19:23 To: Tuomas Juntunen Cc: 'Samuel Just'; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OS

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Mark Nelson
On 04/27/2015 10:11 AM, Alexandre DERUMIER wrote: Is it possible that you were suffering from the bug during the first test but once reinstalled you hadn't hit it yet? yes, I'm pretty sure I'm hitting the tcmalloc bug since the beginning. I had patched it, but I think it's not enough. I had a

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Sage Weil
On Mon, 27 Apr 2015, Tuomas Juntunen wrote: > Thanks for the info. > > For my knowledge there was no snapshots on that pool, but cannot verify > that. Can you attach a 'ceph osd dump -f json-pretty'? That will shed a bit more light on what happened (and the simplest way to fix it). sage >

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Dan van der Ster
Hi Sage, Alexandre et al. Here's another data point... we noticed something similar awhile ago. After we restart our OSDs the "4kB object write latency" [1] temporarily drops from ~8-10ms down to around 3-4ms. Then slowly over time the latency increases back to 8-10ms. The time that the OSDs stay

Re: [ceph-users] Shadow Files

2015-04-27 Thread Yehuda Sadeh-Weinraub
It will get to the ceph mainline eventually. We're still reviewing and testing the fix, and there's more work to be done on the cleanup tool. Yehuda - Original Message - > From: "Ben" > To: "Yehuda Sadeh-Weinraub" > Cc: "ceph-users" > Sent: Sunday, April 26, 2015 11:02:23 PM > Subject

Re: [ceph-users] CephFs - Ceph-fuse Client Read Performance During Cache Tier Flushing

2015-04-27 Thread Mohamed Pakkeer
Hi all The issue is resolved after upgrading Ceph from Giant to Hammer(0.94.1) cheers K.Mohamed Pakkeer On Sun, Apr 26, 2015 at 11:28 AM, Mohamed Pakkeer wrote: > Hi > > I was doing some testing on erasure coded based CephFS cluster. cluster > is running with giant 0.87.1 release. > > > > Clu

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Sage Weil
On Mon, 27 Apr 2015, Alexandre DERUMIER wrote: > >>If I want to use librados API for performance testing, are there any > >>existing benchmark tools which directly accesses librados (not through > >>rbd or gateway) > > you can use "rados bench" from ceph packages > > http://ceph.com/docs/maste

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Alexandre DERUMIER
>>Is it possible that you were suffering from the bug during the first >>test but once reinstalled you hadn't hit it yet? yes, I'm pretty sure I'm hitting the tcmalloc bug since the beginning. I had patched it, but I think it's not enough. I had always this bug in random, but mainly when I have

Re: [ceph-users] Ceph Radosgw multi zone data replication failure

2015-04-27 Thread Vickey Singh
Hello Cephers Still waiting for your hep. I tried sever things but no luck. On Mon, Apr 27, 2015 at 9:07 AM, Vickey Singh wrote: > Any help with related to this problem would be highly appreciated. > > -VS- > > > On Sun, Apr 26, 2015 at 6:01 PM, Vickey Singh > wrote: > >> Hello Geeks >> >>

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-27 Thread Mark Nelson
Hi Alex, Is it possible that you were suffering from the bug during the first test but once reinstalled you hadn't hit it yet? That's a pretty major performance swing. I'm not sure if we can draw any conclusions about jemalloc vs tcmalloc until we can figure out what went wrong. Mark On 0

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
I can sacrifice the images and img pools, if that is necessary. Just need to get the thing going again Tuomas -Original Message- From: Samuel Just [mailto:sj...@redhat.com] Sent: 27. huhtikuuta 2015 15:50 To: tuomas juntunen Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Upgra

Re: [ceph-users] Calamari server not working after upgrade 0.87-1 -> 0.94-1

2015-04-27 Thread Alexandre DERUMIER
Hi, can you check on your ceph node /var/log/salt/minion ? I have had some similar problem, I have need to remove rm /etc/salt/pki/minion/minion_master.pub /etc/init.d/salt-minion restart (I don't known if "calamari-ctl clear" change the salt master key) - Mail original - De: "Stef

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Tuomas Juntunen
Thanks for the info. For my knowledge there was no snapshots on that pool, but cannot verify that. Any way to make this work again? Removing the tier and other settings didn't fix it, I tried it the second this happened. Br, Tuomas -Original Message- From: Samuel Just [mailto:sj...@red

[ceph-users] radosgw default.conf

2015-04-27 Thread alistair.whittle
I have had been trying to configure my radosgw-agent on RHEL 6.5, but after a recent reboot of the gateway node discovered that the file needed by the /etc/init.d/radosgw-agent script has disappeared (/etc/ceph/radosgw-agent/default.conf). As a result, I can no longer start up the radosgw. H

Re: [ceph-users] very different performance on two volumes in the same pool

2015-04-27 Thread Irek Fasikhov
Hi, Nikola. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg19152.html 2015-04-27 14:17 GMT+03:00 Nikola Ciprich : > Hello Somnath, > > Thanks for the perf data..It seems innocuous..I am not seeing single > tcmalloc trace, are you running with tcmalloc by the way ? > > according to ldd

[ceph-users] Calamari server not working after upgrade 0.87-1 -> 0.94-1

2015-04-27 Thread Steffen W Sørensen
All, After successfully upgrading from Giant to Hammer, at first our Calamari server seems fine, showing the new too many PGs. then during/after removing/consolidating various pool, it failed to get updated, Haven’t been able to find any RC, I decided to flush the Postgress DB (calamari-ctl cle

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Samuel Just
So, the base tier is what determines the snapshots for the cache/base pool amalgam. You added a populated pool complete with snapshots on top of a base tier without snapshots. Apparently, it caused an existential crisis for the snapshot code. That's one of the reasons why there is a --force-n

Re: [ceph-users] cephfs: recovering from transport endpoint not connected?

2015-04-27 Thread Yan, Zheng
On Mon, Apr 27, 2015 at 3:42 PM, Burkhard Linke wrote: > Hi, > > I've deployed ceph on a number of nodes in our compute cluster (Ubuntu 14.04 > Ceph Firefly 0.80.9). /ceph is mounted via ceph-fuse. > > From time to time some nodes loose their access to cephfs with the following > error message: >

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread tuomas . juntunen
The following: ceph osd tier add img images --force-nonempty ceph osd tier cache-mode images forward ceph osd tier set-overlay img images Idea was to make images as a tier to img, move data to img then change clients to use the new img pool. Br, Tuomas > Can you explain exactly what you mean

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Samuel Just
Can you explain exactly what you mean by: "Also I created one pool for tier to be able to move data without outage." -Sam - Original Message - From: "tuomas juntunen" To: "Ian Colle" Cc: ceph-users@lists.ceph.com Sent: Monday, April 27, 2015 4:23:44 AM Subject: Re: [ceph-users] Upgrade

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread tuomas . juntunen
Hi Any solution for this yet? Br, Tuomas > It looks like you may have hit http://tracker.ceph.com/issues/7915 > > Ian R. Colle > Global Director > of Software Engineering > Red Hat (Inktank is now part of Red Hat!) > http://www.linkedin.com/in/ircolle > http://www.twitter.com/ircolle > Cell: +1.

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread Ian Colle
It looks like you may have hit http://tracker.ceph.com/issues/7915 Ian R. Colle Global Director of Software Engineering Red Hat (Inktank is now part of Red Hat!) http://www.linkedin.com/in/ircolle http://www.twitter.com/ircolle Cell: +1.303.601.7713 Email: ico...@redhat.com - Original Messag

Re: [ceph-users] very different performance on two volumes in the same pool

2015-04-27 Thread Nikola Ciprich
Hello Somnath, > Thanks for the perf data..It seems innocuous..I am not seeing single tcmalloc > trace, are you running with tcmalloc by the way ? according to ldd, it seems I have it compiled in, yes: [root@vfnphav1a ~]# ldd /usr/bin/ceph-osd . . libtcmalloc.so.4 => /usr/lib64/libtcmalloc.so.4 (

Re: [ceph-users] cluster not coming up after reboot

2015-04-27 Thread Kenneth Waegeman
On 04/23/2015 06:58 PM, Craig Lewis wrote: Yes, unless you've adjusted: [global] mon osd min down reporters = 9 mon osd min down reports = 12 OSDs talk to the MONs on the public network. The cluster network is only used for OSD to OSD communication. If one OSD node can't talk on that n

Re: [ceph-users] Ceph recovery network?

2015-04-27 Thread Sebastien Han
Well yes “pretty much” the same thing :). I think some people would like to distinguish recovery from replication and maybe perform some QoS around these 2. We have to replicate while recovering so one can impact the other. In the end, I just think it’s a doc issue, still waiting for a dev to ans

[ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-04-27 Thread tuomas . juntunen
I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer Then created new pools and deleted some old ones. Also I created one pool for tier to be able to move data without outage. After these operations all but 10 OSD's are down and creating this kind of messages to logs, I get more than 100gb of these

[ceph-users] cephfs: recovering from transport endpoint not connected?

2015-04-27 Thread Burkhard Linke
Hi, I've deployed ceph on a number of nodes in our compute cluster (Ubuntu 14.04 Ceph Firefly 0.80.9). /ceph is mounted via ceph-fuse. From time to time some nodes loose their access to cephfs with the following error message: # ls /ceph ls: cannot access /ceph: Transport endpoint is not co