[ceph-users] Disabling btrfs snapshots for existing OSDs

2015-04-23 Thread Burkhard Linke

Hi,

I have a small number of OSDs running Ubuntu Trusty 14.04 and Ceph 
Firefly 0.80.9. Due to stability issues I would like to disable the 
btrfs snapshot feature (filestore btrfs snap = false).


Is it possible to apply this change to an existing OSD (stop OSD, change 
config, restart OSD), or do I need to recreate the OSD from scratch?


Best regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Hammer question..

2015-04-23 Thread Steffen W Sørensen
> I have a cluster currently on Giant - is Hammer stable/ready for production 
> use?
Assume so, upgraded a 0.87-1 to 0.94-1, only thing that came up was that now 
Ceph will warn if you got too many PGs (>300/OSD) which it turned out I and 
others had. So had too do pool consolidation in order to achieve OK health 
status again, otherwise Hammer is doing fine.

/Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling btrfs snapshots for existing OSDs

2015-04-23 Thread Christian Balzer

Hello,

On Thu, 23 Apr 2015 09:10:13 +0200 Burkhard Linke wrote:

> Hi,
> 
> I have a small number of OSDs running Ubuntu Trusty 14.04 and Ceph 
> Firefly 0.80.9. Due to stability issues I would like to disable the 
> btrfs snapshot feature (filestore btrfs snap = false).
> 
> Is it possible to apply this change to an existing OSD (stop OSD, change 
> config, restart OSD), or do I need to recreate the OSD from scratch?
> 
While I don't know if you can change this mid-race so to speak (but I
would assume yes, as it should affect only new snapshots), what I do know
is that in all likelihood you won't need to stop the OSD to apply the
change.
As in, use the admin socket interface to inject the new setting into the
respective OSD. 
Keeping ceph.conf up to date (if only for reference) is of course helpful,
too.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unbalanced OSDs

2015-04-23 Thread Stefan Priebe - Profihost AG

Am 22.04.2015 um 19:31 schrieb J David:
> On Wed, Apr 22, 2015 at 7:12 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Also a reweight-by-utilization does nothing.
> 
> As a fellow sufferer from this issue, mostly what I can offer you is
> sympathy rather than actual help.  However, this may be beneficial:
> 
> By default, reweight-by-utilization only alters OSD's that are 20%
> above average.  This is really too conservative in our case,
> especially for smaller OSD's.  It also isn't helpful if the problem
> isn't a couple of OSD's way above average, but rather some OSD's way
> below.
> 
> Try:
> 
> # ceph osd reweight-by-utilization 110

Thanks that worked fine.

> or possibly even:
> 
> # ceph osd reweight-by-utilization 105
> 
> This should give more helpful results.
> 
> To the extent that you still have problems after running that, like if
> running it consistently fixes osd.1 but pushes utilizations of osd.2
> up too high and leaves osd.3 mostly empty, then you may have to start
> assigning reweights by hand.
> 
> Also, you didn't mention it explicitly, so if this cluster predates
> 0.80.9 at all you may need to set:
> 
> ceph osd crush set-tunable straw_calc_version 1

I had already done this.

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-23 Thread Steffen W Sørensen
> But in the menu, the use case "cephfs only" doesn't exist and I have
> no idea of the %data for each pools metadata and data. So, what is
> the proportion (approximatively) of %data between the "data" pool and
> the "metadata" pool of cephfs in a cephfs-only cluster?
> 
> Is it rather metadata=20%, data=80%?
> Is it rather metadata=10%, data=90%?
> Is it rather metadata= 5%, data=95%?
> etc.
Assuming miles vary here, depending on your ratio between number of entries in 
your Ceph FS vs their sizes, eg. many small files vs few large ones.
So you are properly the best one to estimate this your self :)

/Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] One more thing. Journal or not to journal or DB-what? Status?

2015-04-23 Thread Götz Reinicke - IT Koordinator
Dear folks,

I'm sorry for the strange subject, but that might show my current
confusion too.

From what I know the writes to an OSD are also journaled for speed and
consistency. Currently that is done to the/a filesystem, that's why a
lot of suggestion are to use SSD for journals.

So far, that's clear.

But I don't understand what the leveldb/rocksdb/LMDB have to do with it.

From what I'v read and understand, those DBs could make the journal with
SSD obsolete. Keyword Filestore/KeyValueStore.

True? Wrong? :)

May be someone can explain this to me? And may be there is a roadmap on
the progress?

We hope to reduce the systems complexity (dedicated journal SSDs) with that.

http://tracker.ceph.com/issues/11028 says "LMDB key/value backend for
Ceph" is done by 70% 15 days ago.


Kowtow, kowtow and thanks . Götz

-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82 420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-23 Thread Alexandre DERUMIER
Hi,
I'm hitting this bug again today.

So don't seem to be numa related (I have try to flush linux buffer to be sure).

and tcmalloc is patched (I don't known how to verify that it's ok).

I don't have restarted osd yet.

Maybe some perf trace could be usefulll ?


- Mail original -
De: "aderumier" 
À: "Srinivasula Maram" 
Cc: "ceph-users" , "ceph-devel" 
, "Milosz Tanski" 
Envoyé: Mercredi 22 Avril 2015 18:30:26
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Hi, 

>>I feel it is due to tcmalloc issue 

Indeed, I had patched one of my node, but not the other. 
So maybe I have hit this bug. (but I can't confirm, I don't have traces). 

But numa interleaving seem to help in my case (maybe not from 100->300k, but 
250k->300k). 

I need to do more long tests to confirm that. 


- Mail original - 
De: "Srinivasula Maram"  
À: "Mark Nelson" , "aderumier" , 
"Milosz Tanski"  
Cc: "ceph-devel" , "ceph-users" 
 
Envoyé: Mercredi 22 Avril 2015 16:34:33 
Objet: RE: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops 

I feel it is due to tcmalloc issue 

I have seen similar issue in my setup after 20 days. 

Thanks, 
Srinivas 



-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson 
Sent: Wednesday, April 22, 2015 7:31 PM 
To: Alexandre DERUMIER; Milosz Tanski 
Cc: ceph-devel; ceph-users 
Subject: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops 

Hi Alexandre, 

We should discuss this at the perf meeting today. We knew NUMA node affinity 
issues were going to crop up sooner or later (and indeed already have in some 
cases), but this is pretty major. It's probably time to really dig in and 
figure out how to deal with this. 

Note: this is one of the reasons I like small nodes with single sockets and 
fewer OSDs. 

Mark 

On 04/22/2015 08:56 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have done a lot of test today, and it seem indeed numa related. 
> 
> My numastat was 
> 
> # numastat 
> node0 node1 
> numa_hit 99075422 153976877 
> numa_miss 167490965 1493663 
> numa_foreign 1493663 167491417 
> interleave_hit 157745 167015 
> local_node 99049179 153830554 
> other_node 167517697 1639986 
> 
> So, a lot of miss. 
> 
> In this case , I can reproduce ios going from 85k to 300k iops, up and down. 
> 
> now setting 
> echo 0 > /proc/sys/kernel/numa_balancing 
> 
> and starting osd daemons with 
> 
> numactl --interleave=all /usr/bin/ceph-osd 
> 
> 
> I have a constant 300k iops ! 
> 
> 
> I wonder if it could be improve by binding osd daemons to specific numa node. 
> I have 2 numanode of 10 cores with 6 osd, but I think it also require 
> ceph.conf osd threads tunning. 
> 
> 
> 
> - Mail original - 
> De: "Milosz Tanski"  
> À: "aderumier"  
> Cc: "ceph-devel" , "ceph-users" 
>  
> Envoyé: Mercredi 22 Avril 2015 12:54:23 
> Objet: Re: [ceph-users] strange benchmark problem : restarting osd 
> daemon improve performance from 100k iops to 300k iops 
> 
> 
> 
> On Wed, Apr 22, 2015 at 5:01 AM, Alexandre DERUMIER < aderum...@odiso.com > 
> wrote: 
> 
> 
> I wonder if it could be numa related, 
> 
> I'm using centos 7.1, 
> and auto numa balacning is enabled 
> 
> cat /proc/sys/kernel/numa_balancing = 1 
> 
> Maybe osd daemon access to buffer on wrong numa node. 
> 
> I'll try to reproduce the problem 
> 
> 
> 
> Can you force the degenerate case using numactl? To either affirm or deny 
> your suspicion. 
> 
> 
> 
> 
> - Mail original - 
> De: "aderumier" < aderum...@odiso.com > 
> À: "ceph-devel" < ceph-de...@vger.kernel.org >, "ceph-users" < 
> ceph-users@lists.ceph.com > 
> Envoyé: Mercredi 22 Avril 2015 10:40:05 
> Objet: [ceph-users] strange benchmark problem : restarting osd daemon 
> improve performance from 100k iops to 300k iops 
> 
> Hi, 
> 
> I was doing some benchmarks, 
> I have found an strange behaviour. 
> 
> Using fio with rbd engine, I was able to reach around 100k iops. 
> (osd datas in linux buffer, iostat show 0% disk access) 
> 
> then after restarting all osd daemons, 
> 
> the same fio benchmark show now around 300k iops. 
> (osd datas in linux buffer, iostat show 0% disk access) 
> 
> 
> any ideas? 
> 
> 
> 
> 
> before restarting osd 
> - 
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=rbd, iodepth=32 ... 
> fio-2.2.7-10-g51e9 
> Starting 10 processes 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> rbd engine: RBD version: 0.1.9 
> ^Cbs: 10 (f=10): [r(10)] [2.9% done] [376.1MB/0KB/0KB /s] [96.6K/0/0 

[ceph-users] Accidentally Remove OSDs

2015-04-23 Thread FaHui Lin

Dear Ceph experts,

I'm a very new Ceph user. I made a blunder that I removed some OSDs (and 
all files in the related directories) before Ceph finished rebalancing 
datas and migrating pgs.


Not to mention the data loss, I meet the problem that:

1) There are always stale pgs showing in ceph status (with heath 
warning). Say one of the stale pg 17.a2:


   # ceph -v
   ceph version *0.87.1* (283c2e7cfa2457799f534744d7d549f83ea1335e)

   # ceph -s
cluster 3f81b47e-fb15-4fbb-9fee-0b1986dfd7ea
 health HEALTH_WARN 203 pgs degraded; 366 pgs stale; 203 pgs
   stuck degraded; *366 pgs stuck stale*; 203 pgs stuck unclean; 203
   pgs stuck undersized; 203 pgs undersized; 154 requests are blocked >
   32 sec; recovery 153738/18991802 objects degraded (0.809%)
 monmap e1: 1 mons at {...=...:6789/0}, election epoch 1,
   quorum 0 tw-ceph01
 osdmap e3697: 12 osds: 12 up, 12 in
  pgmap v21296531: 1156 pgs, 18 pools, 36929 GB data, 9273 kobjects
72068 GB used, 409 TB / 480 TB avail
153738/18991802 objects degraded (0.809%)
 163 stale+active+clean
 786 active+clean
 203 stale+active+undersized+degraded
   4 active+clean+scrubbing+deep


   # ceph pg dump_stuck stale | grep 17.a2
   17.a2   0   0   0   0   0   0 0   0  
   stale+active+clean  2015-04-20 09:16:11.624952 0'0
   2718:200[15,17] 15 [15,17] 15  0'0 2015-04-15

   10:42:37.8806990'0 2015-04-15 10:42:37.880699

   # ceph pg repair 17.a2
   Error EAGAIN: pg 17.a2 primary osd.15 not up

   # ceph pg scrub 17.a2
   Error EAGAIN: pg 17.a2 primary osd.15 not up

   # ceph pg map 17.a2
   osdmap e3695 pg 17.a2 (17.a2) -> up [27,3] acting [27,3]


where osd.15 had already been removed. It seems to map to the existing 
OSDs ([27, 3]).
Can this pg finally get recovered by changing to the existing OSDs? If 
not, how can I do about this kind of stale pg?


2) I tried to solve the problem above by creating OSDs back but failed. 
The reason was I cannot create an OSD with the same ID to that I 
removed, say osd.15 (or change the id of an OSD).
Is there any way to change the id of an OSD? (By the way, I'm suprised 
that this issue can hardly be found on the internet.)


3) I tried another thing: to dump the crushmap and remove everything 
(including devices and buckets sections) related to the OSDs I removed. 
However, after I set the crushmap and dumped it out again, I found the 
OSDs's line still appear in the devices section (not in the buckets 
section though), such as:


   # devices
   device 0 osd.0
   device 2 osd.2
   device 3 osd.3
   device 4 osd.4
   *device 5 device5**
   **...**
   **device 14 device14**
   **device 15 device15*


Is there anyway to remove them? Does it matters when I want to add new OSDs?

Please inform me if you have any comments. Thank you.

Best Regards,
FaHui

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] systemd unit files and multiple daemons

2015-04-23 Thread HEWLETT, Paul (Paul)** CTR **
What about running multiple clusters on the same host?

There is a separate mail thread about being able to run clusters with different 
conf files on the same host.
Will the new systemd service scripts cope with this?

Paul Hewlett
Senior Systems Engineer
Velocix, Cambridge
Alcatel-Lucent
t: +44 1223 435893




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Gregory 
Farnum [g...@gregs42.com]
Sent: 22 April 2015 23:26
To: Ken Dreyer
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] systemd unit files and multiple daemons

On Wed, Apr 22, 2015 at 2:57 PM, Ken Dreyer  wrote:
> I could really use some eyes on the systemd change proposed here:
> http://tracker.ceph.com/issues/11344
>
> Specifically, on bullet #4 there, should we have a single
> "ceph-mon.service" (implying that users should only run one monitor
> daemon per server) or if we should support multiple "ceph-mon@" services
> (implying that users will need to specify additional information when
> starting the service(s)). The version in our tree is "ceph-mon@". James'
> work for Ubuntu Vivid is only "ceph-mon" [2]. Same thing for ceph-mds vs
> ceph-mds@.
>
> I'd prefer to keep Ubuntu downstream the same as Ceph upstream.
>
> What do we want to do for this?
>
> How common is it to run multiple monitor daemons or mds daemons on a
> single host?

For a real deployment, you shouldn't be running multiple monitors on a
single node in the general case. I'm not sure if we want to prohibit
it by policy, but I'd be okay with the idea.
For testing purposes (in ceph-qa-suite or using vstart as a developer)
it's pretty common though, and we probably don't want to have to
rewrite all our tests to change that. I'm not sure that vstart ever
uses the regular init system, but teuthology/ceph-qa-suite obviously
do!

For MDSes, it's probably appropriate/correct to support multiple
daemons on the same host. This can be either a fault tolerance thing,
or just a way of better using multiple cores if you're living on the
(very dangerous) edge.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "Compacting" btrfs file storage

2015-04-23 Thread Burkhard Linke

Hi,

I've noticed that the btrfs file storage is performing some 
cleanup/compacting operations during OSD startup.


Before OSD start:
/dev/sdc1  2.8T  2.4T  390G  87% /var/lib/ceph/osd/ceph-58

After OSD start:
/dev/sdc1  2.8T  2.2T  629G  78% /var/lib/ceph/osd/ceph-58

OSDs are configured with firefly default settings.

This "compacting" of the underlying storage happens during the PG 
loading phase of the OSD start.


Is it possible to trigger this compacting without restarting the OSD?

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Object Gateway in star topology

2015-04-23 Thread Evgeny P. Kurbatov
Hi, everyone!

Consider N nodes that receive and store some objects, and a node N+1
acting as a central storage. No one of N nodes can see the objects of
each other but the central node can see everything. We would run
independent Ceph storage on each of N nodes and replicate objects to
the central storage. However, this design wouldn't have advantage over
an obvious solution with a multi-master replication between relational
DBs like Postgres.

Is there a beautiful and powered itself solution by the methods of
Ceph? Say, each of N nodes stays in its own pool or zone (not sure if I
use this terms properly) of a single storage, while the (N+1)'th node
stays in all zones simultaneously. If there is a solution please point
to a location in the manuals I should read.

Regards,
Evgeny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse unable to run through "screen" ?

2015-04-23 Thread Steffen W Sørensen

> On 23/04/2015, at 10.24, Florent B  wrote:
> 
> I come back with this problem because it persists even after upgrading
> to Hammer.
> 
> With CephFS, it does not work, and the only workaround I found does not
> work 100% of time :
I also found issues at reboots also, becaouse starting Ceph fuse daemon will 
possible create mount point directory, so I got this:

in /etc/fstab I have:

id=admin  /var/lib/ceph/backup  fuse.ceph defaults 0 0

then I run below script from /etc/rc.local :

#!/bin/sh

# after boot Ceph FS doesn't get mounted, so run this to verify and optional 
mount

mounted=`df -h /var/lib/ceph/backup/.| grep -c '^ceph-fuse'`
if [ $mounted -eq 1 ]; then
  echo CephFS is mounted
else
  echo CephFS is not mounted, clearing mountpoint
  cd /var/lib/ceph
  mv backup backup.old
  mkdir backup
  # assume it is in the fstab
  mount /var/lib/ceph/backup
  # a bit dangerous :/ so ONLY on mount success
  [ $? -eq 0 ] && rm -rf backup.old
fi


It seems to work most of the times otherwise I run script once by hand :)

/Steffen

> shell: bash -c "mountpoint /var/www/sites/default/files || rm -Rf
> /var/www/sites/default/files/{*,*.*,.*}; screen -d -m -S cephfs-drupal
> mount /var/www/sites/default/files"
> 
> Sometimes it mounts, sometimes not... that's really weird.
> 
> My mount point is configured with daemonize=false, because if I set it
> to true, it never works !
> 
> I really does not understand what the problem is.
> 
> What Ceph-fuse needs to mount correctly 100% of times ??
> 
> Thank you.
> 
> On 03/18/2015 10:42 AM, Florent B wrote:
>> In fact, my problem is not related to Ansible.
>> 
>> For example, make a bash script :
>> 
>> #! /bin/bash
>> mountpoint /mnt/cephfs || mount /mnt/cephfs
>> 
>> 
>> And run it with "screen" : screen mount-me.sh
>> 
>> Directory is not mounted !
>> 
>> What is this ? :D
>> 
>> If you run the script without screen, all works fine !
>> 
>> Is there any kind of particular return system with ceph-fuse ?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD move after reboot

2015-04-23 Thread Jake Grimmett

Dear All,

I have multiple disk types (15k & 7k) on each ceph node, which I assign 
to different pools, but have a problem as whenever I reboot a node, the 
OSD's move in the CRUSH map.


i.e. on host ceph4, before a reboot I have this osd tree

-10  7.68980 root 15k-disk
(snip)
 -9  2.19995 host ceph4-15k
 44  0.54999 osd.44 up  1.0  1.0
 45  0.54999 osd.45 up  1.0  1.0
 46  0.54999 osd.46 up  1.0  1.0
 47  0.54999 osd.47 up  1.0  1.0
(snip)
 -1 34.96852 root 7k-disk
(snip)
 -5  7.36891 host ceph4
 24  0.90999 osd.24 up  1.0  1.0
 25  0.90999 osd.25 up  1.0  1.0
 26  0.90999 osd.26   down0  1.0
 27  0.90999 osd.27 up  1.0  1.0
 28  0.90999 osd.28 up  1.0  1.0
 29  0.90999 osd.29 up  1.0  1.0
 31  0.90999 osd.31 up  1.0  1.0
 30  0.99899 osd.30 up  1.0  1.0


After a reboot I have this:

-10  5.48985 root 15k-disk
 -6  2.19995 host ceph1-15k
 32  0.54999 osd.32 up  1.0  1.0
 33  0.54999 osd.33 up  1.0  1.0
 34  0.54999 osd.34 up  1.0  1.0
 35  0.54999 osd.35 up  1.0  1.0
 -70 host ceph2-15k
 -80 host ceph3-15k
 -90 host ceph4-15k
-1 37.16847 root 7k-disk
(snip)
 -5  9.56886 host ceph4
 24  0.90999 osd.24 up  1.0  1.0
 25  0.90999 osd.25 up  1.0  1.0
 26  0.90999 osd.26   down0  1.0
 27  0.90999 osd.27 up  1.0  1.0
 28  0.90999 osd.28 up  1.0  1.0
 29  0.90999 osd.29 up  1.0  1.0
 31  0.90999 osd.31 up  1.0  1.0
 30  0.99899 osd.30 up  1.0  1.0
 44  0.54999 osd.44 up  1.0  1.0
 46  0.54999 osd.46 up  1.0  1.0
 47  0.54999 osd.47 up  1.0  1.0
 45  0.54999 osd.45 up  1.0  1.0


My current cludge, is to just put a series of "osd crush set" lines like 
this in rc.local:


ceph osd crush set osd.44 0.54999 root=15k-disk host=ceph4-15k

but presumably this is not the right solution...

I'm using hammer (0.94.1) on Scientific Linux 6.6

Full details on how I added OSD and edited CRUSH map are here:

http://pastebin.com/R2yaab8m

Many thanks!

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Another OSD Crush question.

2015-04-23 Thread Rogier Dikkes
Hello all, 

At this moment we have a scenario where i would like your opinion on. 

Scenario: 
Currently we have a ceph environment with 1 rack of hardware, this rack 
contains a couple of OSD nodes with 4T disks. In a few months time we will 
deploy 2 more racks with OSD nodes, these nodes have 6T disks and 1 node more 
per rack. 

Short overview: 
rack1: 4T OSD
rack2: 6T OSD
rack3: 6T OSD

At this moment we are playing around with the idea to use the CRUSH map to make 
ceph 'rack aware' and ensure to have data replicated between racks. However 
from documentation i gathered i found that when you enforce data replication 
between buckets then your max storage size will be the lowest bucket value. My 
understanding: enforce the objects (size=3) to be replicated to 3 racks, the 
moment the rack with 4T OSD's is full we cannot store data anymore. 

Is this assumption correct?

The current idea we play with: 

- Create 2 rack buckets
- Create a ruleset to create 2 object replica’s for the 2x 6T buckets
- Create a ruleset to create 1 object replica over all the hosts.

This would result in 3 replicas of the object. Where we are sure that 2 objects 
at least are in different racks. In the unlikely event of a rack failure we 
would have at least 1 or 2 replica’s left.

Our idea is to have a crush rule with config that looks like: 
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9


  host r01-cn01 {
  id -1
  alg straw
  hash 0
  item osd.0 weight 4.00
  }

  host r01-cn02 {
  id -2
  alg straw
  hash 0
  item osd.1 weight 4.00
  }

  host r01-cn03 {
  id -3
  alg straw
  hash 0
  item osd.3 weight 4.00
  }

  host r02-cn04 {
  id -4
  alg straw
  hash 0
  item osd.4 weight 6.00
  }

  host r02-cn05 {
  id -5
  alg straw
  hash 0
  item osd.5 weight 6.00
  }

  host r02-cn06 {
  id -6
  alg straw
  hash 0
  item osd.6 weight 6.00
  }

  host r03-cn07 {
  id -7
  alg straw
  hash 0
  item osd.7 weight 6.00
  }

  host r03-cn08 {
  id -8
  alg straw
  hash 0
  item osd.8 weight 6.00
  }

  host r03-cn09 {
  id -9
  alg straw
  hash 0
  item osd.9 weight 6.00
  }

  rack r02 {
  id -10
  alg straw
  hash 0
  item r02-cn04 weight 6.00
  item r02-cn05 weight 6.00
  item r02-cn06 weight 6.00
  }  

  rack r03 {
  id -11
  alg straw
  hash 0
  item r03-cn07 weight 6.00
  item r03-cn08 weight 6.00
  item r03-cn09 weight 6.00
  }

  root 6t {
  id -12
  alg straw
  hash 0
  item r02 weight 18.00
  item r03 weight 18.00
  }

  rule one {
  ruleset 1
  type replicated
  min_size 1
  max_size 10
  step take 6t
  step chooseleaf firstn 2 type rack
  step chooseleaf firstn 1 type host
  step emit
  }
Is this the right approach and would this cause limitations in regards of 
performance or usability? Do you have suggestions? 

Another interesting situation we have now is: We are going to move the hardware 
to new locations next year, the rack layout will change and thus the crush map 
will be altered. When changing a CRUSH map that theoretically would change the 
2x 6T racks into 4 racks, would we need to take any special actions into 
consideration?

Thank you for your answers, they are much appreciated! 

Rogier Dikkes
System Programmer Hadoop & HPC Cloud
SURFsara | Science Park 140 | 1098 XG Amsterdam

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] many slow requests on different osds - STRANGE!

2015-04-23 Thread Ritter Sławomir
=== Facts ===
1. RadosGW disabled, rados bench write -  10 x bigger traffic served without 
any slow_request.
2. RadosGW enabled  - first slow_requests.
3. Traffic via RadosGW – 20-50 slow_requests per minute (~0,1% of IOPS)
4. We compacted leveldb on MONs 48h before first slow_requests.  Maybe there is 
some strange correlation?

=== Details ===
ceph.log
2015-04-22 17:46:33.805926 osd.165 10.176.131.17:6810/98485 3 : [WRN] 1 slow 
requests, 1 included below; oldest blocked for > 36.621691 secs
2015-04-22 17:46:33.805933 osd.165 10.176.131.17:6810/98485 4 : [WRN] slow 
request 36.621691 seconds old, received at 2015-04-22 17:45:57.184143: 
osd_op(client.22009265.0:244576 
default.2139585.95_cache/6f63f82baac53e854c126598c6dc8aaf_800x600_90.jpg 
[create 0~0,setxattr user.rgw.idtag (23),writefull 0~9971,setxattr 
user.rgw.manifest (296),setxattr user.rgw.acl (129),setxattr 
user.rgw.content_type (25),setxattr user.rgw.etag (33),setxattr 
user.rgw.x-amz-meta-md5 (33),setxattr user.rgw.x-amz-meta-sync-priority (5)] 
10.58f70cfe e106833) v4 currently waiting for subops from [260,187]

heartbeat before slow_req:
2015-04-22 17:46:13.059658 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059669 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:13.059677 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059683 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:13.059693 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059701 7fcba064e700 20 heartbeat_map is_healthy = NOT 
HEALTHY
….
2015-04-22 17:46:33.794865 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.794877 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:33.795912 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.795923 7fcba064e700 20 heartbeat_map is_healthy = NOT 
HEALTHY

Loosy connections a few second before slow_requests, even on loopbacks:
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471808 7ffcd7837700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.131.4:0/2937379 pipe(0x103ce780 sd=180 
:6829 s=2 pgs=121 cs=1 l=1 c=0x103cbdc0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471828 7ffcd7736700 10 -- 
10.176.131.4:6821/2938116 >> 10.176.131.4:0/2937379 pipe(0x103b5280 sd=742 
:6821 s=2 pgs=122 cs=1 l=1 c=0x103beb00).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229544 7ffcfd28f700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.11:0/3658878 pipe(0x1073f280 sd=735 
:6829 s=2 pgs=1919 cs=1 l=1 c=0xff6dc60).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229670 7ffcf813e700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.29:0/2882509 pipe(0x17f74000 sd=862 
:6829 s=2 pgs=582 cs=1 l=1 c=0x101e96e0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229676 7ffcf4f0c700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.35:0/2697022 pipe(0xff8ba00 sd=190 
:6829 s=2 pgs=13 cs=1 l=1 c=0xff6d420).fault on lossy channel, failing

Network
We checked it twice with iperf, ping, etc. It works OK, sysctls also are 
optimized. We are observing a peaks of tcp rst and tcp drops when there are 
slow_req’s, but it seems that this is caused by the problem.

Drives/Machines
We checked our drives (smart), raid contolers and machines and everything looks 
correctly. We disable caches on raid contolers and on the drives and nothing 
change. There is no leader in the histogram of slow requests. Any drive or 
machine with bigger problems then others.

Regards,

Slawomir Ritter


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Craig 
Lewis
Sent: Friday, April 17, 2015 7:56 PM
To: Dominik Mostowiec
Cc: Studziński Krzysztof; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] many slow requests on different osds (scrubbing 
disabled)

I've seen something like this a few times.

Once, I lost the battery in my battery backed RAID card.  That caused all the 
OSDs on that host to be slow, which triggered slow request notices pretty much 
cluster wide.  It was only when I histogrammed the slow request notices that I 
saw most of them were on a single node.  I compared the disk latency graphs 
between nodes, and saw that one node had a much higher write latency. This took 
me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a 
group of bad sector, remap, repeat.  SMART warned me about it, so I replaced 
the disk after the second slow request alerts.  This was pretty straight 
forward to diagnose, only because smartd notified me.


I both cases, I saw "slow request" notic

Re: [ceph-users] ceph-disk activate hangs with external journal device

2015-04-23 Thread Daniel Piddock
On 22/04/15 20:32, Robert LeBlanc wrote:
> I believe your problem is that you haven't created bootstrap-osd key
> and distributed it to your OSD node in /var/lib/ceph/bootstrap-osd/.

Hi Robert,

Thank you for your reply.

In my original post, steps performed, I did include copying over the
bootstrap-osd key. Also "ceph-disk activate" fails with an obvious error
when that file is missing:

2015-04-23 10:16:47.245951 7fccc5a9c700 -1 monclient(hunting): ERROR:
missing keyring, cannot use cephx for authentication
2015-04-23 10:16:47.245955 7fccc5a9c700  0 librados:
client.bootstrap-osd initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound
ERROR:ceph-disk:Failed to activate
ceph-disk: Error: ceph osd create failed: Command '/usr/bin/ceph'
returned non-zero exit status 1:

This is not the source of my issue.

Dan

>
>
> On Wed, Apr 22, 2015 at 5:41 AM, Daniel Piddock
> mailto:dgp-c...@corefiling.co.uk>> wrote:
>
> Hi,
>
> I'm a ceph newbie setting up some trial installs for evaluation.
>
> Using Debian stable (Wheezy) with Ceph Firefly from backports
> (0.80.7-1~bpo70+1).
>
> I've been following the instructions at
> http://docs.ceph.com/docs/firefly/install/manual-deployment/ and first
> time through went well, using a partition on the same drive as the
> OS. I
> then migrated to having data on separate harddrives and that
> worked too.
>
> I'm currently trying to get an OSD set up with the journal on an SSD
> partition that's separate from the data drive. ceph-disk is not
> playing
> ball and I've been getting various forms of failure. My greatest
> success
> was getting the OSD created but it would never go "up". I'm struggling
> to find anything useful in the logs or really what to look for.
>
> I purged the ceph package and wiped the storage drives to give me a
> blank slate and tried again.
>
> Steps performed:
>
> camel (MON server):
> $ apt-get install ceph
> $ uuidgen #= 8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2
> # created /etc/ceph/ceph.conf, attached
> $ ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key
> -n mon. \
>   --cap mon 'allow *'
> $ ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
> --gen-key \
>   -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow
> *' \
>   --cap mds 'allow'
> $ ceph-authtool /tmp/ceph.mon.keyring --import-keyring \
>   /etc/ceph/ceph.client.admin.keyring
> $ monmaptool --create --add a 10.1.0.3 --fsid \
>   8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2 /tmp/monmap
> $ ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring
> /tmp/ceph.mon.keyring
> $ /etc/init.d/ceph start mon
> $ ceph osd lspools #= 0 data,1 metadata,2 rbd,
>
> storage node 1:
> $ apt-get install ceph
> $ rsync -a camel:/etc/ceph/ceph.conf /etc/ceph/
> $ rsync -a camel:/var/lib/ceph/bootstrap-osd/ceph.keyring \
>   /var/lib/ceph/bootstrap-osd/
> $ ceph-disk prepare --cluster ceph --cluster-uuid \
>   8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2 /dev/sdb /dev/sdc
>
> Output:
> cannot read partition index; assume it isn't present
>  (Error: Command '/sbin/parted' returned non-zero exit status 1)
> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the
> same device as the osd data
> Creating new GPT entries.
> Information: Moved requested sector from 34 to 2048 in
> order to align on 2048-sector boundaries.
> The operation has completed successfully.
> Creating new GPT entries.
> Information: Moved requested sector from 34 to 2048 in
> order to align on 2048-sector boundaries.
> The operation has completed successfully.
> meta-data=/dev/sdb1  isize=2048   agcount=4,
> agsize=15262347
> blks
>  =   sectsz=512   attr=2, projid32bit=0
> data =   bsize=4096   blocks=61049385,
> imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0
> log  =internal log   bsize=4096   blocks=29809, version=2
>  =   sectsz=512   sunit=0 blks,
> lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> The operation has completed successfully.
>
> $ ceph-disk activate /dev/sdb1
> Hangs
>
> Looking at ps -efH I can see that ceph-disk launched:
> /usr/bin/ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap
> /var/lib/ceph/tmp/mnt.ST6Kz_/activate.monmap --osd-data
> /var/lib/ceph/tmp/mnt.ST6Kz_ --osd-journal
> /var/lib/ceph/tmp/mnt.ST6Kz_/journal --osd-uuid
> 636f694a-3677-44f0-baaf-4d74195b1806 --keyring
> /var/lib/ceph/tmp/mnt.ST6Kz_/keyring
>
> /var/lib/ceph/tmp/mnt.ST6Kz_ contains:
> activate.monmap  current/ journal  

[ceph-users] SAS-Exp 9300-8i or Raid-Contr 9750-4i ?

2015-04-23 Thread Markus Goldberg

Hi,
i will upgrade my existing Hardware (in 3 SC847-cases with 30 HDDs each) 
the next days.

Should i buy a SAS-Expander 9300-8i or keep my existing Raid-Contr 9750-4i.

The 9300 will give me real Jbod , the 9750 only single-disk Raids.

Does this make a real difference ? Should i spend or keep money?

--
MfG,
  Markus Goldberg

--
Markus Goldberg   Universität Hildesheim
  Rechenzentrum
Tel +49 5121 88392822 Universitätsplatz 1, D-31141 Hildesheim, Germany
Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Object Gateway in star topology

2015-04-23 Thread Evgeny P. Kurbatov
Hi, everyone!

Consider N nodes that receive and store some objects, and a node N+1
acting as a central storage. No one of N nodes can see the objects of
each other but the central node can see everything. We would run
independent Ceph storage on each of N nodes and replicate objects to
the central storage. However, this design wouldn't have advantage over
an obvious solution with a multi-master replication between relational
DBs like Postgres.

Is there a beautiful and powered itself solution by the methods of
Ceph? Say, each of N nodes stays in its own pool or zone (not sure if I
use this terms properly) of a single storage, while the (N+1)'th node
stays in all zones simultaneously. If there is a solution please point
to a location in the manuals I should read.

Regards,
Evgeny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse unable to run through "screen" ?

2015-04-23 Thread Burkhard Linke

Hi,

I had a similar problem during reboots. It was solved by adding 
'_netdev' to the options for the fstab entry. Otherwise the system may 
try to mount the cephfs mount point before the network is available.


This solution is for Ubuntu, YMMV.

Best regards,
Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD move after reboot

2015-04-23 Thread Antonio Messina
On Thu, Apr 23, 2015 at 11:18 AM, Jake Grimmett  wrote:
> Dear All,
>
> I have multiple disk types (15k & 7k) on each ceph node, which I assign to
> different pools, but have a problem as whenever I reboot a node, the OSD's
> move in the CRUSH map.

I just found out that you can customize the way OSDs are automatically
added to the crushmap using an hook script.

I have in ceph.conf:

osd crush location hook = /usr/local/sbin/sc-ceph-crush-location

this will return the correct bucket and root for the specific osd.

I also have

osd crush update on start = true

which should be the default.

This way, whenever an OSD starts, it's automatically added to correct bucket.

ref: http://ceph.com/docs/master/rados/operations/crush-map/#crush-location

.a.

-- 
antonio.s.mess...@gmail.com
antonio.mess...@uzh.ch +41 (0)44 635 42 22
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD move after reboot

2015-04-23 Thread Burkhard Linke

Hi,

On 04/23/2015 11:18 AM, Jake Grimmett wrote:

Dear All,

I have multiple disk types (15k & 7k) on each ceph node, which I 
assign to different pools, but have a problem as whenever I reboot a 
node, the OSD's move in the CRUSH map.


i.e. on host ceph4, before a reboot I have this osd tree

-10  7.68980 root 15k-disk
(snip)
 -9  2.19995 host ceph4-15k

*snipsnap*

 -1 34.96852 root 7k-disk
(snip)
 -5  7.36891 host ceph4

*snipsnap*

After a reboot I have this:

-10  5.48985 root 15k-disk
 -6  2.19995 host ceph1-15k
 32  0.54999 osd.32 up  1.0 1.0
 33  0.54999 osd.33 up  1.0 1.0
 34  0.54999 osd.34 up  1.0 1.0
 35  0.54999 osd.35 up  1.0 1.0
 -70 host ceph2-15k
 -80 host ceph3-15k
 -90 host ceph4-15k
-1 37.16847 root 7k-disk
(snip)
 -5  9.56886 host ceph4

*snipsnap*



My current cludge, is to just put a series of "osd crush set" lines 
like this in rc.local:


ceph osd crush set osd.44 0.54999 root=15k-disk host=ceph4-15k

*snipsnap*

Upon reboot, the OSD updates its location in the crush tree by default. 
It uses the hostname of the box if no other information is given (output 
of 'hostname -s').


You can either disable updating the location at all or define a custom 
location (either fixed or via a script). See the "CRUSH LOCATION" 
paragraph on http://docs.ceph.com/docs/master/rados/operations/crush-map/


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD move after reboot

2015-04-23 Thread Antonio Messina
On Thu, Apr 23, 2015 at 11:18 AM, Jake Grimmett  wrote:
> Dear All,
>
> I have multiple disk types (15k & 7k) on each ceph node, which I assign to
> different pools, but have a problem as whenever I reboot a node, the OSD's
> move in the CRUSH map.

I just found out that you can customize the way OSDs are automatically
added to the crushmap using an hook script.

I have in ceph.conf:

osd crush location hook = /usr/local/sbin/sc-ceph-crush-location

this will return the correct bucket and root for the specific osd.

I also have

osd crush update on start = true

which should be the default.

This way, whenever an OSD starts, it's automatically added to correct bucket.

ref: http://ceph.com/docs/master/rados/operations/crush-map/#crush-location

.a.

P.S. I apologize if you received this message twice, I've sent it from
the wrong email address the first time.

-- 
antonio.s.mess...@gmail.com
antonio.mess...@uzh.ch +41 (0)44 635 42 22
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-23 Thread Alexandre DERUMIER
Maybe it's tcmalloc related
I thinked to have patched it correctly, but perf show a lot of 
tcmalloc::ThreadCache::ReleaseToCentralCache

before osd restart (100k)
--
  11.66%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   8.51%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::FetchFromSpans
   3.04%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::ReleaseToSpans
   2.04%ceph-osd  libtcmalloc.so.4.1.2  [.] operator new
   1.63% swapper  [kernel.kallsyms] [k] intel_idle
   1.35%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::ReleaseListToSpans
   1.33%ceph-osd  libtcmalloc.so.4.1.2  [.] operator delete
   1.07%ceph-osd  libstdc++.so.6.0.19   [.] std::basic_string, std::allocator >::basic_string
   0.91%ceph-osd  libpthread-2.17.so[.] pthread_mutex_trylock
   0.88%ceph-osd  libc-2.17.so  [.] __memcpy_ssse3_back
   0.81%ceph-osd  ceph-osd  [.] Mutex::Lock
   0.79%ceph-osd  [kernel.kallsyms] [k] 
copy_user_enhanced_fast_string
   0.74%ceph-osd  libpthread-2.17.so[.] pthread_mutex_unlock
   0.67%ceph-osd  [kernel.kallsyms] [k] _raw_spin_lock
   0.63% swapper  [kernel.kallsyms] [k] native_write_msr_safe
   0.62%ceph-osd  [kernel.kallsyms] [k] avc_has_perm_noaudit
   0.58%ceph-osd  ceph-osd  [.] operator<
   0.57%ceph-osd  [kernel.kallsyms] [k] __schedule
   0.57%ceph-osd  [kernel.kallsyms] [k] __d_lookup_rcu
   0.54% swapper  [kernel.kallsyms] [k] __schedule


after osd restart (300k iops)
--
   3.47%  ceph-osd  libtcmalloc.so.4.1.2  [.] operator new
   1.92%  ceph-osd  libtcmalloc.so.4.1.2  [.] operator delete
   1.86%   swapper  [kernel.kallsyms] [k] intel_idle
   1.52%  ceph-osd  libstdc++.so.6.0.19   [.] std::basic_string, std::allocator >::basic_string
   1.34%  ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   1.24%  ceph-osd  libc-2.17.so  [.] __memcpy_ssse3_back
   1.23%  ceph-osd  ceph-osd  [.] Mutex::Lock
   1.21%  ceph-osd  libpthread-2.17.so[.] pthread_mutex_trylock
   1.11%  ceph-osd  [kernel.kallsyms] [k] copy_user_enhanced_fast_string
   0.95%  ceph-osd  libpthread-2.17.so[.] pthread_mutex_unlock
   0.94%  ceph-osd  [kernel.kallsyms] [k] _raw_spin_lock
   0.78%  ceph-osd  [kernel.kallsyms] [k] __d_lookup_rcu
   0.70%  ceph-osd  [kernel.kallsyms] [k] tcp_sendmsg
   0.70%  ceph-osd  ceph-osd  [.] Message::Message
   0.68%  ceph-osd  [kernel.kallsyms] [k] __schedule
   0.66%  ceph-osd  [kernel.kallsyms] [k] idle_cpu
   0.65%  ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::FetchFromSpans
   0.64%   swapper  [kernel.kallsyms] [k] native_write_msr_safe
   0.61%  ceph-osd  ceph-osd  [.] 
std::tr1::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
   0.60%   swapper  [kernel.kallsyms] [k] __schedule
   0.60%  ceph-osd  libstdc++.so.6.0.19   [.] 0x000bdd2b
   0.57%  ceph-osd  ceph-osd  [.] operator<
   0.57%  ceph-osd  ceph-osd  [.] crc32_iscsi_00
   0.56%  ceph-osd  libstdc++.so.6.0.19   [.] std::string::_Rep::_M_dispose
   0.55%  ceph-osd  [kernel.kallsyms] [k] __switch_to
   0.54%  ceph-osd  libc-2.17.so  [.] vfprintf
   0.52%  ceph-osd  [kernel.kallsyms] [k] fget_light

- Mail original -
De: "aderumier" 
À: "Srinivasula Maram" 
Cc: "ceph-users" , "ceph-devel" 
, "Milosz Tanski" 
Envoyé: Jeudi 23 Avril 2015 10:00:34
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Hi, 
I'm hitting this bug again today. 

So don't seem to be numa related (I have try to flush linux buffer to be sure). 

and tcmalloc is patched (I don't known how to verify that it's ok). 

I don't have restarted osd yet. 

Maybe some perf trace could be usefulll ? 


- Mail original - 
De: "aderumier"  
À: "Srinivasula Maram"  
Cc: "ceph-users" , "ceph-devel" 
, "Milosz Tanski"  
Envoyé: Mercredi 22 Avril 2015 18:30:26 
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops 

Hi, 

>>I feel it is due to tcmalloc issue 

Indeed, I had patched one of my node, but not the other. 
So maybe I have hit this bug. (but I can't confirm, I don't have traces). 

But numa interleaving seem to help in my case (maybe not from 100->300k, but 
250k->300k). 

I need to do more long tests to confirm that. 


- Mail original - 
De: "Srinivasula Maram"  
À: "Mark Nelson" , "aderumier" , 
"Milosz Tanski"  
Cc: "ceph-devel" , "ceph-users" 
 
Envoyé: 

Re: [ceph-users] SAS-Exp 9300-8i or Raid-Contr 9750-4i ?

2015-04-23 Thread Weeks, Jacob (RIS-BCT)
The 9750-4i may support JBOD mode. If you would like to test, install the 
MegaCLI tools. (Warning, this will clear the RAID configuration on your device)

This is the command for switching the RAID controller to JBOD mode: 

#megacli -AdpSetProp -EnableJBOD -1 -aAll

Thanks,

Jacob

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Markus 
Goldberg
Sent: Thursday, April 23, 2015 5:41 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SAS-Exp 9300-8i or Raid-Contr 9750-4i ?

Hi,
i will upgrade my existing Hardware (in 3 SC847-cases with 30 HDDs each) the 
next days.
Should i buy a SAS-Expander 9300-8i or keep my existing Raid-Contr 9750-4i.

The 9300 will give me real Jbod , the 9750 only single-disk Raids.

Does this make a real difference ? Should i spend or keep money?

--
MfG,
   Markus Goldberg

--
Markus Goldberg   Universität Hildesheim
   Rechenzentrum
Tel +49 5121 88392822 Universitätsplatz 1, D-31141 Hildesheim, Germany Fax +49 
5121 88392823 email goldb...@uni-hildesheim.de
--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 The information contained in this 
e-mail message is intended only for the personal and confidential use of the 
recipient(s) named above. This message may be an attorney-client communication 
and/or work product and as such is privileged and confidential. If the reader 
of this message is not the intended recipient or an agent responsible for 
delivering it to the intended recipient, you are hereby notified that you have 
received this document in error and that any review, dissemination, 
distribution, or copying of this message is strictly prohibited. If you have 
received this communication in error, please notify us immediately by e-mail, 
and delete the original message.  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-23 Thread Mark Nelson

Thanks for the testing Alexandre!

If you have the means to compile the same version of ceph with jemalloc, 
I would be very interested to see how it does.


In some ways I'm glad it turned out not to be NUMA.  I still suspect we 
will have to deal with it at some point, but perhaps not today. ;)


Mark

On 04/23/2015 05:58 AM, Alexandre DERUMIER wrote:

Maybe it's tcmalloc related
I thinked to have patched it correctly, but perf show a lot of 
tcmalloc::ThreadCache::ReleaseToCentralCache

before osd restart (100k)
--
   11.66%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
8.51%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::FetchFromSpans
3.04%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::ReleaseToSpans
2.04%ceph-osd  libtcmalloc.so.4.1.2  [.] operator new
1.63% swapper  [kernel.kallsyms] [k] intel_idle
1.35%ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::ReleaseListToSpans
1.33%ceph-osd  libtcmalloc.so.4.1.2  [.] operator delete
1.07%ceph-osd  libstdc++.so.6.0.19   [.] std::basic_string, std::allocator >::basic_string
0.91%ceph-osd  libpthread-2.17.so[.] pthread_mutex_trylock
0.88%ceph-osd  libc-2.17.so  [.] __memcpy_ssse3_back
0.81%ceph-osd  ceph-osd  [.] Mutex::Lock
0.79%ceph-osd  [kernel.kallsyms] [k] 
copy_user_enhanced_fast_string
0.74%ceph-osd  libpthread-2.17.so[.] pthread_mutex_unlock
0.67%ceph-osd  [kernel.kallsyms] [k] _raw_spin_lock
0.63% swapper  [kernel.kallsyms] [k] native_write_msr_safe
0.62%ceph-osd  [kernel.kallsyms] [k] avc_has_perm_noaudit
0.58%ceph-osd  ceph-osd  [.] operator<
0.57%ceph-osd  [kernel.kallsyms] [k] __schedule
0.57%ceph-osd  [kernel.kallsyms] [k] __d_lookup_rcu
0.54% swapper  [kernel.kallsyms] [k] __schedule


after osd restart (300k iops)
--
3.47%  ceph-osd  libtcmalloc.so.4.1.2  [.] operator new
1.92%  ceph-osd  libtcmalloc.so.4.1.2  [.] operator delete
1.86%   swapper  [kernel.kallsyms] [k] intel_idle
1.52%  ceph-osd  libstdc++.so.6.0.19   [.] std::basic_string, std::allocator >::basic_string
1.34%  ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
1.24%  ceph-osd  libc-2.17.so  [.] __memcpy_ssse3_back
1.23%  ceph-osd  ceph-osd  [.] Mutex::Lock
1.21%  ceph-osd  libpthread-2.17.so[.] pthread_mutex_trylock
1.11%  ceph-osd  [kernel.kallsyms] [k] 
copy_user_enhanced_fast_string
0.95%  ceph-osd  libpthread-2.17.so[.] pthread_mutex_unlock
0.94%  ceph-osd  [kernel.kallsyms] [k] _raw_spin_lock
0.78%  ceph-osd  [kernel.kallsyms] [k] __d_lookup_rcu
0.70%  ceph-osd  [kernel.kallsyms] [k] tcp_sendmsg
0.70%  ceph-osd  ceph-osd  [.] Message::Message
0.68%  ceph-osd  [kernel.kallsyms] [k] __schedule
0.66%  ceph-osd  [kernel.kallsyms] [k] idle_cpu
0.65%  ceph-osd  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::FetchFromSpans
0.64%   swapper  [kernel.kallsyms] [k] native_write_msr_safe
0.61%  ceph-osd  ceph-osd  [.] 
std::tr1::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
0.60%   swapper  [kernel.kallsyms] [k] __schedule
0.60%  ceph-osd  libstdc++.so.6.0.19   [.] 0x000bdd2b
0.57%  ceph-osd  ceph-osd  [.] operator<
0.57%  ceph-osd  ceph-osd  [.] crc32_iscsi_00
0.56%  ceph-osd  libstdc++.so.6.0.19   [.] std::string::_Rep::_M_dispose
0.55%  ceph-osd  [kernel.kallsyms] [k] __switch_to
0.54%  ceph-osd  libc-2.17.so  [.] vfprintf
0.52%  ceph-osd  [kernel.kallsyms] [k] fget_light

- Mail original -
De: "aderumier" 
À: "Srinivasula Maram" 
Cc: "ceph-users" , "ceph-devel" , 
"Milosz Tanski" 
Envoyé: Jeudi 23 Avril 2015 10:00:34
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Hi,
I'm hitting this bug again today.

So don't seem to be numa related (I have try to flush linux buffer to be sure).

and tcmalloc is patched (I don't known how to verify that it's ok).

I don't have restarted osd yet.

Maybe some perf trace could be usefulll ?


- Mail original -
De: "aderumier" 
À: "Srinivasula Maram" 
Cc: "ceph-users" , "ceph-devel" , 
"Milosz Tanski" 
Envoyé: Mercredi 22 Avril 2015 18:30:26
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Hi,


I feel it is due to tcmalloc issue


Indeed, I had patched one of my 

[ceph-users] how to disable the warning log"Disabling LTTng-UST per-user tracing. "?

2015-04-23 Thread x...@csvcn.cn
hi all,

after i upgraded ceph to 0.94.1 , it complains the following log everytime when 
i restart ceph-osd, is there some method to disable this logs?


libust[19291/19291]: Warning: HOME environment variable not set. Disabling 
LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-23 Thread Alexandre DERUMIER
>>If you have the means to compile the same version of ceph with jemalloc, 
>>I would be very interested to see how it does.

Yes, sure. (I have around 3-4 weeks to do all the benchs)

But I don't know how to do it ? 
I'm running the cluster on centos7.1, maybe it can be easy to patch the srpms 
to rebuild the package with jemalloc.



- Mail original -
De: "Mark Nelson" 
À: "aderumier" , "Srinivasula Maram" 

Cc: "ceph-users" , "ceph-devel" 
, "Milosz Tanski" 
Envoyé: Jeudi 23 Avril 2015 13:33:00
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Thanks for the testing Alexandre! 

If you have the means to compile the same version of ceph with jemalloc, 
I would be very interested to see how it does. 

In some ways I'm glad it turned out not to be NUMA. I still suspect we 
will have to deal with it at some point, but perhaps not today. ;) 

Mark 

On 04/23/2015 05:58 AM, Alexandre DERUMIER wrote: 
> Maybe it's tcmalloc related 
> I thinked to have patched it correctly, but perf show a lot of 
> tcmalloc::ThreadCache::ReleaseToCentralCache 
> 
> before osd restart (100k) 
> -- 
> 11.66% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::ThreadCache::ReleaseToCentralCache 
> 8.51% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::FetchFromSpans 
> 3.04% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::ReleaseToSpans 
> 2.04% ceph-osd libtcmalloc.so.4.1.2 [.] operator new 
> 1.63% swapper [kernel.kallsyms] [k] intel_idle 
> 1.35% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::ReleaseListToSpans 
> 1.33% ceph-osd libtcmalloc.so.4.1.2 [.] operator delete 
> 1.07% ceph-osd libstdc++.so.6.0.19 [.] std::basic_string std::char_traits, std::allocator >::basic_string 
> 0.91% ceph-osd libpthread-2.17.so [.] pthread_mutex_trylock 
> 0.88% ceph-osd libc-2.17.so [.] __memcpy_ssse3_back 
> 0.81% ceph-osd ceph-osd [.] Mutex::Lock 
> 0.79% ceph-osd [kernel.kallsyms] [k] copy_user_enhanced_fast_string 
> 0.74% ceph-osd libpthread-2.17.so [.] pthread_mutex_unlock 
> 0.67% ceph-osd [kernel.kallsyms] [k] _raw_spin_lock 
> 0.63% swapper [kernel.kallsyms] [k] native_write_msr_safe 
> 0.62% ceph-osd [kernel.kallsyms] [k] avc_has_perm_noaudit 
> 0.58% ceph-osd ceph-osd [.] operator< 
> 0.57% ceph-osd [kernel.kallsyms] [k] __schedule 
> 0.57% ceph-osd [kernel.kallsyms] [k] __d_lookup_rcu 
> 0.54% swapper [kernel.kallsyms] [k] __schedule 
> 
> 
> after osd restart (300k iops) 
> -- 
> 3.47% ceph-osd libtcmalloc.so.4.1.2 [.] operator new 
> 1.92% ceph-osd libtcmalloc.so.4.1.2 [.] operator delete 
> 1.86% swapper [kernel.kallsyms] [k] intel_idle 
> 1.52% ceph-osd libstdc++.so.6.0.19 [.] std::basic_string std::char_traits, std::allocator >::basic_string 
> 1.34% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::ThreadCache::ReleaseToCentralCache 
> 1.24% ceph-osd libc-2.17.so [.] __memcpy_ssse3_back 
> 1.23% ceph-osd ceph-osd [.] Mutex::Lock 
> 1.21% ceph-osd libpthread-2.17.so [.] pthread_mutex_trylock 
> 1.11% ceph-osd [kernel.kallsyms] [k] copy_user_enhanced_fast_string 
> 0.95% ceph-osd libpthread-2.17.so [.] pthread_mutex_unlock 
> 0.94% ceph-osd [kernel.kallsyms] [k] _raw_spin_lock 
> 0.78% ceph-osd [kernel.kallsyms] [k] __d_lookup_rcu 
> 0.70% ceph-osd [kernel.kallsyms] [k] tcp_sendmsg 
> 0.70% ceph-osd ceph-osd [.] Message::Message 
> 0.68% ceph-osd [kernel.kallsyms] [k] __schedule 
> 0.66% ceph-osd [kernel.kallsyms] [k] idle_cpu 
> 0.65% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::FetchFromSpans 
> 0.64% swapper [kernel.kallsyms] [k] native_write_msr_safe 
> 0.61% ceph-osd ceph-osd [.] 
> std::tr1::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
> 0.60% swapper [kernel.kallsyms] [k] __schedule 
> 0.60% ceph-osd libstdc++.so.6.0.19 [.] 0x000bdd2b 
> 0.57% ceph-osd ceph-osd [.] operator< 
> 0.57% ceph-osd ceph-osd [.] crc32_iscsi_00 
> 0.56% ceph-osd libstdc++.so.6.0.19 [.] std::string::_Rep::_M_dispose 
> 0.55% ceph-osd [kernel.kallsyms] [k] __switch_to 
> 0.54% ceph-osd libc-2.17.so [.] vfprintf 
> 0.52% ceph-osd [kernel.kallsyms] [k] fget_light 
> 
> - Mail original - 
> De: "aderumier"  
> À: "Srinivasula Maram"  
> Cc: "ceph-users" , "ceph-devel" 
> , "Milosz Tanski"  
> Envoyé: Jeudi 23 Avril 2015 10:00:34 
> Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
> improve performance from 100k iops to 300k iops 
> 
> Hi, 
> I'm hitting this bug again today. 
> 
> So don't seem to be numa related (I have try to flush linux buffer to be 
> sure). 
> 
> and tcmalloc is patched (I don't known how to verify that it's ok). 
> 
> I don't have restarted osd yet. 
> 
> Maybe some perf trace could be usefulll ? 
> 
> 
> - Mail original - 
> De: "aderumier"  
> À: "Srinivasula Maram"  
> Cc: "ceph-users" , "ceph-devel" 
> , "Milosz Tanski"  
> Envoyé: Mercredi 22 Avril 2015 18:30:26 
> Objet: Re: [

Re: [ceph-users] Powering down a ceph cluster

2015-04-23 Thread 10 minus
Thanks Wido ...

It worked.



On Wed, Apr 22, 2015 at 5:33 PM, Wido den Hollander  wrote:

>
>
> > Op 22 apr. 2015 om 16:54 heeft 10 minus  het
> volgende geschreven:
> >
> > Hi,
> >
> > Is there a recommended way of powering down a ceph cluster and bringing
> it back up ?
> >
> > I have looked  thru the docs and cannot find anything wrt it.
> >
>
> Best way would be:
> - Stop all client I/O
> - Shut down the OSDs
> - Shut down the monitors
>
> Afterwards, boot the monitors first, then the OSDs.
>
> Wido
>
> >
> > Thanks in advance
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster not coming up after reboot

2015-04-23 Thread Kenneth Waegeman



On 04/22/2015 07:35 PM, Gregory Farnum wrote:

On Wed, Apr 22, 2015 at 8:17 AM, Kenneth Waegeman
 wrote:

Hi,

I changed the cluster network parameter in the config files, restarted the
monitors , and then restarted all the OSDs (shouldn't have done that).


Do you mean that you changed the IP addresses of the monitors in the
config files everywhere, and then tried to restart things? Or
something else?
I only changed the value of the cluster network to a different one then 
the public network



Now
the OSDS keep on crashing, and the cluster is not able to restore.. I
eventually rebooted the whole cluster, but the problem remains: For a moment
all 280 OSDs are up, and then they start crashing rapidly until there are
only less than 100 left (and eventually 30 or so).


Are the OSDs actually crashing, or are they getting shut down? If
they're crashing, can you please provide the actual backtrace? The
logs you're including below are all fairly low level and generally
don't even mean something has to be wrong.


It seems I did not tested the network throughfully enough, there was one 
host that was unable to connect to the cluster network, only the public 
network. I've found this out after all but the osds of that host came up 
after a few hours. I fixed the network issue and all was fine (only a 
few peering problems, but a restart of those osds blocking was sufficient)
There were no backtraces, and indeed I found out there were some 
shutdown messages in the logs.


So it is all fixed now, but is it explainable that at first about 90% of 
the OSDS going into shutdown over and over, and only after some time got 
in a stable situation, because of one host network failure ?


Thanks again!




In the log files I see different kind of messages: Some OSDs have:

I tested the network, the hosts can reach one another on both networks..


What configurations did you test?
14 hosts with each 16 keyvalue osds , 2 replicated cache partitions and 
metadata partitions on 2 SSDs for cephfs.



-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling btrfs snapshots for existing OSDs

2015-04-23 Thread Krzysztof Nowicki
I have done this via a restart of the OSDs after adding the configuration
option in ceph.conf. It works fine. My Ceph version is 0.80.5.

One thing worth to note is that you'll sooner or later want to remove stale
snap_* subvolumes as leaving them will cause a slow increase of disk usage
on your OSD filesystem as the snapshots will hold references to extents
holding data that has been changed since.

Regards
Chris

czw., 23.04.2015 o 09:30 użytkownik Christian Balzer 
napisał:

>
> Hello,
>
> On Thu, 23 Apr 2015 09:10:13 +0200 Burkhard Linke wrote:
>
> > Hi,
> >
> > I have a small number of OSDs running Ubuntu Trusty 14.04 and Ceph
> > Firefly 0.80.9. Due to stability issues I would like to disable the
> > btrfs snapshot feature (filestore btrfs snap = false).
> >
> > Is it possible to apply this change to an existing OSD (stop OSD, change
> > config, restart OSD), or do I need to recreate the OSD from scratch?
> >
> While I don't know if you can change this mid-race so to speak (but I
> would assume yes, as it should affect only new snapshots), what I do know
> is that in all likelihood you won't need to stop the OSD to apply the
> change.
> As in, use the admin socket interface to inject the new setting into the
> respective OSD.
> Keeping ceph.conf up to date (if only for reference) is of course helpful,
> too.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk activate hangs with external journal device

2015-04-23 Thread Robert LeBlanc
Sorry, reading too fast. That key isn't from a previous attempt, correct?
But I doubt that is the problem as you would receive an access denied
message in the logs.

Try running Ceph-disk zap and recreate the OSD. Also remove the Auth key
and the osd (ceph osd rm ) then do a ceph-disk prepare. I don't think
the first stay up should be trying to create file systems, that should have
been done with prepare.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Apr 23, 2015 3:43 AM, "Daniel Piddock"  wrote:

>  On 22/04/15 20:32, Robert LeBlanc wrote:
>
> I believe your problem is that you haven't created bootstrap-osd key and
> distributed it to your OSD node in /var/lib/ceph/bootstrap-osd/.
>
>
> Hi Robert,
>
> Thank you for your reply.
>
> In my original post, steps performed, I did include copying over the
> bootstrap-osd key. Also "ceph-disk activate" fails with an obvious error
> when that file is missing:
>
> 2015-04-23 10:16:47.245951 7fccc5a9c700 -1 monclient(hunting): ERROR:
> missing keyring, cannot use cephx for authentication
> 2015-04-23 10:16:47.245955 7fccc5a9c700  0 librados: client.bootstrap-osd
> initialization error (2) No such file or directory
> Error connecting to cluster: ObjectNotFound
> ERROR:ceph-disk:Failed to activate
> ceph-disk: Error: ceph osd create failed: Command '/usr/bin/ceph' returned
> non-zero exit status 1:
>
> This is not the source of my issue.
>
> Dan
>
>
>
> On Wed, Apr 22, 2015 at 5:41 AM, Daniel Piddock  > wrote:
>
>> Hi,
>>
>> I'm a ceph newbie setting up some trial installs for evaluation.
>>
>> Using Debian stable (Wheezy) with Ceph Firefly from backports
>> (0.80.7-1~bpo70+1).
>>
>> I've been following the instructions at
>> http://docs.ceph.com/docs/firefly/install/manual-deployment/ and first
>> time through went well, using a partition on the same drive as the OS. I
>> then migrated to having data on separate harddrives and that worked too.
>>
>> I'm currently trying to get an OSD set up with the journal on an SSD
>> partition that's separate from the data drive. ceph-disk is not playing
>> ball and I've been getting various forms of failure. My greatest success
>> was getting the OSD created but it would never go "up". I'm struggling
>> to find anything useful in the logs or really what to look for.
>>
>> I purged the ceph package and wiped the storage drives to give me a
>> blank slate and tried again.
>>
>> Steps performed:
>>
>> camel (MON server):
>> $ apt-get install ceph
>> $ uuidgen #= 8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2
>> # created /etc/ceph/ceph.conf, attached
>> $ ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. \
>>   --cap mon 'allow *'
>> $ ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
>> --gen-key \
>>   -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' \
>>   --cap mds 'allow'
>> $ ceph-authtool /tmp/ceph.mon.keyring --import-keyring \
>>   /etc/ceph/ceph.client.admin.keyring
>> $ monmaptool --create --add a 10.1.0.3 --fsid \
>>   8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2 /tmp/monmap
>> $ ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring
>> /tmp/ceph.mon.keyring
>> $ /etc/init.d/ceph start mon
>> $ ceph osd lspools #= 0 data,1 metadata,2 rbd,
>>
>> storage node 1:
>> $ apt-get install ceph
>> $ rsync -a camel:/etc/ceph/ceph.conf /etc/ceph/
>> $ rsync -a camel:/var/lib/ceph/bootstrap-osd/ceph.keyring \
>>   /var/lib/ceph/bootstrap-osd/
>> $ ceph-disk prepare --cluster ceph --cluster-uuid \
>>   8c9ff7b5-904a-4f9a-8c9e-d2f8b05b55d2 /dev/sdb /dev/sdc
>>
>> Output:
>> cannot read partition index; assume it isn't present
>>  (Error: Command '/sbin/parted' returned non-zero exit status 1)
>> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the
>> same device as the osd data
>> Creating new GPT entries.
>> Information: Moved requested sector from 34 to 2048 in
>> order to align on 2048-sector boundaries.
>> The operation has completed successfully.
>> Creating new GPT entries.
>> Information: Moved requested sector from 34 to 2048 in
>> order to align on 2048-sector boundaries.
>> The operation has completed successfully.
>> meta-data=/dev/sdb1  isize=2048   agcount=4, agsize=15262347
>> blks
>>  =   sectsz=512   attr=2, projid32bit=0
>> data =   bsize=4096   blocks=61049385, imaxpct=25
>>  =   sunit=0  swidth=0 blks
>> naming   =version 2  bsize=4096   ascii-ci=0
>> log  =internal log   bsize=4096   blocks=29809, version=2
>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> The operation has completed successfully.
>>
>> $ ceph-disk activate /dev/sdb1
>> Hangs
>>
>> Looking at ps -efH I can see that ceph-disk launched:
>> /usr/bin/ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap
>> /var/lib/ceph/tmp/mnt.ST6Kz_/activate.monmap --osd-data

Re: [ceph-users] long blocking with writes on rbds

2015-04-23 Thread Jeff Epstein
The appearance of these socket closed messages seems to coincide with 
the slowdown symptoms. What is the cause?


2015-04-23T14:08:47.111838+00:00 i-65062482 kernel: [ 4229.485489] libceph: 
osd1 192.168.160.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:06.961823+00:00 i-65062482 kernel: [ 4249.332547] libceph: 
osd2 192.168.96.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:09.701819+00:00 i-65062482 kernel: [ 4252.070594] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:10.381817+00:00 i-65062482 kernel: [ 4252.755400] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:14.831817+00:00 i-65062482 kernel: [ 4257.200257] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:57.061877+00:00 i-65062482 kernel: [ 4539.431624] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:57.541842+00:00 i-65062482 kernel: [ 4539.913284] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:59.801822+00:00 i-65062482 kernel: [ 4542.177187] libceph: 
osd3 192.168.0.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:11.361819+00:00 i-65062482 kernel: [ 4553.733566] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:47.871829+00:00 i-65062482 kernel: [ 4590.242136] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:47.991826+00:00 i-65062482 kernel: [ 4590.364078] libceph: 
osd2 192.168.96.4:6800 socket closed (con state OPEN)

2015-04-23T14:15:00.081817+00:00 i-65062482 kernel: [ 4602.452980] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:16:21.301820+00:00 i-65062482 kernel: [ 4683.671614] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)



Jeff

On 04/23/2015 12:26 AM, Jeff Epstein wrote:



Do you have some idea how I can diagnose this problem?


I'll look at ceph -s output while you get these stuck process to see 
if there's any unusual activity (scrub/deep 
scrub/recovery/bacfills/...). Is it correlated in any way with rbd 
removal (ie: write blocking don't appear unless you removed at least 
one rbd for say one hour before the write performance problems).


I'm not familiar with Amazon VMs. If you map the rbds using the 
kernel driver to local block devices do you have control over the 
kernel you run (I've seen reports of various problems with older 
kernels and you probably want the latest possible) ?


ceph status shows nothing unusual. However, on the problematic node, 
we typically see entries in ps like this:


 1468 12329 root D 0.0 mkfs.ext4   wait_on_page_bit
 1468 12332 root D 0.0 mkfs.ext4   wait_on_buffer

Notice the "D" blocking state. Here, mkfs is stopped on some wait 
functions for long periods of time. (Also, we are formatting the RBDs 
as ext4 even though the OSDs are xfs; I assume this shouldn't be a 
problem?)


We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated 
kernel driver isn't out of the question; if anyone has any concrete 
information, I'd be grateful.


Jeff


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] long blocking with writes on rbds

2015-04-23 Thread Nick Fisk
Hi Jeff,

I believe these are normal, they are just the connections IDLE timing out to
the OSD's because no traffic has flowed recently. They are probably a
symptom rather than a cause.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jeff Epstein
> Sent: 23 April 2015 15:19
> To: Lionel Bouton; Christian Balzer
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] long blocking with writes on rbds
> 
> The appearance of these socket closed messages seems to coincide with the
> slowdown symptoms. What is the cause?
> 
> 2015-04-23T14:08:47.111838+00:00 i-65062482 kernel: [ 4229.485489]
libceph:
> osd1 192.168.160.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:09:06.961823+00:00 i-65062482 kernel: [ 4249.332547]
libceph:
> osd2 192.168.96.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:09:09.701819+00:00 i-65062482 kernel: [ 4252.070594]
libceph:
> osd4 192.168.64.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:09:10.381817+00:00 i-65062482 kernel: [ 4252.755400]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:09:14.831817+00:00 i-65062482 kernel: [ 4257.200257]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:13:57.061877+00:00 i-65062482 kernel: [ 4539.431624]
libceph:
> osd4 192.168.64.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:13:57.541842+00:00 i-65062482 kernel: [ 4539.913284]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:13:59.801822+00:00 i-65062482 kernel: [ 4542.177187]
libceph:
> osd3 192.168.0.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:14:11.361819+00:00 i-65062482 kernel: [ 4553.733566]
libceph:
> osd4 192.168.64.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:14:47.871829+00:00 i-65062482 kernel: [ 4590.242136]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:14:47.991826+00:00 i-65062482 kernel: [ 4590.364078]
libceph:
> osd2 192.168.96.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:15:00.081817+00:00 i-65062482 kernel: [ 4602.452980]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 2015-04-23T14:16:21.301820+00:00 i-65062482 kernel: [ 4683.671614]
libceph:
> osd5 192.168.128.4:6800 socket closed (con state OPEN)
> 
> 
> 
> Jeff
> 
> On 04/23/2015 12:26 AM, Jeff Epstein wrote:
> >
>  Do you have some idea how I can diagnose this problem?
> >>>
> >>> I'll look at ceph -s output while you get these stuck process to see
> >>> if there's any unusual activity (scrub/deep
> >>> scrub/recovery/bacfills/...). Is it correlated in any way with rbd
> >>> removal (ie: write blocking don't appear unless you removed at least
> >>> one rbd for say one hour before the write performance problems).
> >>
> >> I'm not familiar with Amazon VMs. If you map the rbds using the
> >> kernel driver to local block devices do you have control over the
> >> kernel you run (I've seen reports of various problems with older
> >> kernels and you probably want the latest possible) ?
> >
> > ceph status shows nothing unusual. However, on the problematic node,
> > we typically see entries in ps like this:
> >
> >  1468 12329 root D 0.0 mkfs.ext4   wait_on_page_bit
> >  1468 12332 root D 0.0 mkfs.ext4   wait_on_buffer
> >
> > Notice the "D" blocking state. Here, mkfs is stopped on some wait
> > functions for long periods of time. (Also, we are formatting the RBDs
> > as ext4 even though the OSDs are xfs; I assume this shouldn't be a
> > problem?)
> >
> > We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated
> > kernel driver isn't out of the question; if anyone has any concrete
> > information, I'd be grateful.
> >
> > Jeff
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] read performance VS network usage

2015-04-23 Thread SCHAER Frederic
Hi again,

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). 
For these tests, I've setup a RAID0 on the 23 disks.
For now, I'm not using SSDs as I discovered my vendor apparently decreased 
their perfs on purpose...

So : 5 server nodes of which 3 are MONS too.
I also have 5 clients.
All of them have a single 10G NIC,  I'm not using a private network.
I'm testing EC pools, with the failure domain set to hosts.
The EC pool k/m is set to k=4/m=1
I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64)

And... I just found out I had "limited" read performance.
While I was watching the stats using dstat on one server node, I noticed that 
during the rados (read) bench, all the server nodes sent about 370MiB/s on the 
network, which is the average speed I get per server, but they also all 
received about 750-800MiB/s on that same network. And 800MB/s is about as much 
as you can get on a 10G link...

I'm trying to understand why I see this inbound data flow ?

-  Why does a server node receive data at all during a read bench ?

-  Why is it about twice as much as the data the node is sending ?

-  Is this about verifying data integrity at read time ?

I'm alone on the cluster, it's not used anywhere else.
I will try tomorrow to see if adding a 2nd 10G port (with a private network 
this time) improves the performance, but I'm really curious here to understand 
what's the bottleneck and what's ceph doing... ?

Looking at the write performance, I see the same kind of behavior : nodes send 
about half the amount of data they receive (600MB/300MB), but this might be 
because this time the client only sends the real data and the erasure coding 
happens behind the scenes (or not ?)

Any idea ?

Regards
Frederic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing a ceph fs

2015-04-23 Thread Kenneth Waegeman



On 04/22/2015 06:51 PM, Gregory Farnum wrote:

If you look at the "ceph --help" output you'll find some commands for
removing MDSes from the system.


Yes, this works for all but the last mds..

[root@mds01 ~]# ceph mds rm 35632 mds.mds03
Error EBUSY: cannot remove active mds.mds03 rank 0

I stopped the daemon, checked the process was stopped, even did a 
shutdown of that mds server, I keep getting this message and am unable 
to remove the fs ..


log file has this:

2015-04-23 16:14:05.171450 7fa9fe799700 -1 mds.0.4 *** got signal 
Terminated ***
2015-04-23 16:14:05.171490 7fa9fe799700  1 mds.0.4 suicide.  wanted 
down:dne, now up:active





-Greg
On Wed, Apr 22, 2015 at 6:46 AM Kenneth Waegeman
mailto:kenneth.waege...@ugent.be>> wrote:

forgot to mention I'm running 0.94.1

On 04/22/2015 03:02 PM, Kenneth Waegeman wrote:
 > Hi,
 >
 > I tried to recreate a ceph fs ( well actually an underlying pool, but
 > for that I need to first remove the fs) , but this seems not that
easy
 > to achieve.
 >
 > When I run
 > `ceph fs rm ceph_fs`
 > I get:
 > `Error EINVAL: all MDS daemons must be inactive before removing
filesystem`
 >
 > I stopped the 3 MDSs, but this doesn't change anything, as ceph
health
 > still "thinks" there is an mds running laggy:
 >
 >   health HEALTH_WARN
 >  mds cluster is degraded
 >  mds mds03 is laggy
 >   monmap e1: 3 mons at ...
 >  election epoch 12, quorum 0,1,2 mds01,mds02,mds03
 >   mdsmap e12: 1/1/1 up {0=mds03=up:replay(laggy or crashed)}
 >
 > I checked the mds processes are gone..
 >
 > Someone knows a solution for this?
 >
 > Thanks!
 > Kenneth
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com 
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accidentally Remove OSDs

2015-04-23 Thread Robert LeBlanc
A full CRUSH dump would be helpful, as well as knowing which OSDs you took
out. If you didn't take 17 out as well as 15, then you might be OK. If the
OSDs still show up in your CRUSH, then try and remove them from the CRSH
map with 'ceph osd crush rm osd.15'.

If you took out both OSDs, you will need to use some of the recovery tools.
I believe the procedure is roughly, mount the drive in another box, extract
the PGs needed, then shut down the primary OSD for that PG, inject the PG
into the OSD, then start it up and it should replicate. I haven't done it
myself (probably something I should do in case I ever run into the problem).

On Thu, Apr 23, 2015 at 2:00 AM, FaHui Lin  wrote:

>  Dear Ceph experts,
>
> I'm a very new Ceph user. I made a blunder that I removed some OSDs (and
> all files in the related directories) before Ceph finished rebalancing
> datas and migrating pgs.
>
> Not to mention the data loss, I meet the problem that:
>
> 1) There are always stale pgs showing in ceph status (with heath warning).
> Say one of the stale pg 17.a2:
>
> # ceph -v
> ceph version *0.87.1* (283c2e7cfa2457799f534744d7d549f83ea1335e)
>
> # ceph -s
> cluster 3f81b47e-fb15-4fbb-9fee-0b1986dfd7ea
>  health HEALTH_WARN 203 pgs degraded; 366 pgs stale; 203 pgs stuck
> degraded; *366 pgs stuck stale*; 203 pgs stuck unclean; 203 pgs stuck
> undersized; 203 pgs undersized; 154 requests are blocked > 32 sec; recovery
> 153738/18991802 objects degraded (0.809%)
>  monmap e1: 1 mons at {...=...:6789/0}, election epoch 1, quorum 0
> tw-ceph01
>  osdmap e3697: 12 osds: 12 up, 12 in
>   pgmap v21296531: 1156 pgs, 18 pools, 36929 GB data, 9273 kobjects
> 72068 GB used, 409 TB / 480 TB avail
> 153738/18991802 objects degraded (0.809%)
>  163 stale+active+clean
>  786 active+clean
>  203 stale+active+undersized+degraded
>4 active+clean+scrubbing+deep
>
>
> # ceph pg dump_stuck stale | grep 17.a2
> 17.a2   0   0   0   0   0   0   0   0
> stale+active+clean  2015-04-20 09:16:11.624952 0'0
> 2718:200[15,17] 15  [15,17] 15  0'0 2015-04-15
> 10:42:37.8806990'0  2015-04-15 10:42:37.880699
>
> # ceph pg repair 17.a2
> Error EAGAIN: pg 17.a2 primary osd.15 not up
>
> # ceph pg scrub 17.a2
> Error EAGAIN: pg 17.a2 primary osd.15 not up
>
> # ceph pg map 17.a2
> osdmap e3695 pg 17.a2 (17.a2) -> up [27,3] acting [27,3]
>
>
> where osd.15 had already been removed. It seems to map to the existing
> OSDs ([27, 3]).
> Can this pg finally get recovered by changing to the existing OSDs? If
> not, how can I do about this kind of stale pg?
>
> 2) I tried to solve the problem above by creating OSDs back but failed.
> The reason was I cannot create an OSD with the same ID to that I removed,
> say osd.15 (or change the id of an OSD).
> Is there any way to change the id of an OSD? (By the way, I'm suprised
> that this issue can hardly be found on the internet.)
>
> 3) I tried another thing: to dump the crushmap and remove everything
> (including devices and buckets sections) related to the OSDs I removed.
> However, after I set the crushmap and dumped it out again, I found the
> OSDs's line still appear in the devices section (not in the buckets section
> though), such as:
>
> # devices
> device 0 osd.0
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> *device 5 device5*
> *...*
> *device 14 device14*
> *device 15 device15*
>
>
> Is there anyway to remove them? Does it matters when I want to add new
> OSDs?
>
> Please inform me if you have any comments. Thank you.
>
> Best Regards,
> FaHui
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] systemd unit files and multiple daemons

2015-04-23 Thread Sage Weil
On Thu, 23 Apr 2015, HEWLETT, Paul (Paul)** CTR ** wrote:
> What about running multiple clusters on the same host?
> 
> There is a separate mail thread about being able to run clusters with 
> different conf files on the same host.
> Will the new systemd service scripts cope with this?

As currently planned, no.  Unfortunately systemd only allows a single 
substitution/id for identifying a daemon instance.  If we try to use 
that for both cluster and (osd/mon) id (e.g., ceph-1, mycluster-232) 
it gets ugly because we can separate them into different fields.  
The current plan is for the cluster name to be specified in 
/etc/sysconfig/ceph or similar.

I'm hoping anyone who really needs multiple clusters on the same host can 
accomplish that using containers... would that cover your use case?

sage


> 
> Paul Hewlett
> Senior Systems Engineer
> Velocix, Cambridge
> Alcatel-Lucent
> t: +44 1223 435893
> 
> 
> 
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Gregory 
> Farnum [g...@gregs42.com]
> Sent: 22 April 2015 23:26
> To: Ken Dreyer
> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] systemd unit files and multiple daemons
> 
> On Wed, Apr 22, 2015 at 2:57 PM, Ken Dreyer  wrote:
> > I could really use some eyes on the systemd change proposed here:
> > http://tracker.ceph.com/issues/11344
> >
> > Specifically, on bullet #4 there, should we have a single
> > "ceph-mon.service" (implying that users should only run one monitor
> > daemon per server) or if we should support multiple "ceph-mon@" services
> > (implying that users will need to specify additional information when
> > starting the service(s)). The version in our tree is "ceph-mon@". James'
> > work for Ubuntu Vivid is only "ceph-mon" [2]. Same thing for ceph-mds vs
> > ceph-mds@.
> >
> > I'd prefer to keep Ubuntu downstream the same as Ceph upstream.
> >
> > What do we want to do for this?
> >
> > How common is it to run multiple monitor daemons or mds daemons on a
> > single host?
> 
> For a real deployment, you shouldn't be running multiple monitors on a
> single node in the general case. I'm not sure if we want to prohibit
> it by policy, but I'd be okay with the idea.
> For testing purposes (in ceph-qa-suite or using vstart as a developer)
> it's pretty common though, and we probably don't want to have to
> rewrite all our tests to change that. I'm not sure that vstart ever
> uses the regular init system, but teuthology/ceph-qa-suite obviously
> do!
> 
> For MDSes, it's probably appropriate/correct to support multiple
> daemons on the same host. This can be either a fault tolerance thing,
> or just a way of better using multiple cores if you're living on the
> (very dangerous) edge.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance VS network usage

2015-04-23 Thread Nick Fisk
Hi Frederic,

 

If you are using EC pools, the primary OSD requests the remaining shards of
the object from the other OSD's, reassembles it and then sends the data to
the client. The entire object needs to be reconstructed even for a small IO
operation, so 4kb reads could lead to quite a large IO amplification if you
are using the default 4MB object sizes. I believe this is what you are
seeing, although creating a RBD with smaller object sizes can help reduce
this.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
SCHAER Frederic
Sent: 23 April 2015 15:40
To: ceph-users@lists.ceph.com
Subject: [ceph-users] read performance VS network usage

 

Hi again,

 

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs
drives). For these tests, I've setup a RAID0 on the 23 disks.

For now, I'm not using SSDs as I discovered my vendor apparently decreased
their perfs on purpose.

 

So : 5 server nodes of which 3 are MONS too.

I also have 5 clients.

All of them have a single 10G NIC,  I'm not using a private network.

I'm testing EC pools, with the failure domain set to hosts.

The EC pool k/m is set to k=4/m=1

I'm testing EC pools using the giant release
(ceph-0.87.1-0.el7.centos.x86_64)

 

And. I just found out I had "limited" read performance.

While I was watching the stats using dstat on one server node, I noticed
that during the rados (read) bench, all the server nodes sent about 370MiB/s
on the network, which is the average speed I get per server, but they also
all received about 750-800MiB/s on that same network. And 800MB/s is about
as much as you can get on a 10G link.

 

I'm trying to understand why I see this inbound data flow ?

-  Why does a server node receive data at all during a read bench ?

-  Why is it about twice as much as the data the node is sending ?

-  Is this about verifying data integrity at read time ?

 

I'm alone on the cluster, it's not used anywhere else.

I will try tomorrow to see if adding a 2nd 10G port (with a private network
this time) improves the performance, but I'm really curious here to
understand what's the bottleneck and what's ceph doing. ?

 

Looking at the write performance, I see the same kind of behavior : nodes
send about half the amount of data they receive (600MB/300MB), but this
might be because this time the client only sends the real data and the
erasure coding happens behind the scenes (or not ?)

 

Any idea ?

 

Regards

Frederic




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Swift and Ceph

2015-04-23 Thread alistair.whittle
All,

I was hoping for some advice.   I have recently built a Ceph cluster on RHEL 
6.5 and have configured RGW.  I want to test Swift API access, and as a 
result have created a user, swift subuser and swift keys as per the output 
below:


1.   Create user


radosgw-admin user create --uid="testuser1" --display-name="Test User1"
{ "user_id": "testuser1",
  "display_name": "Test User1",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [],
  "keys": [
{ "user": "testuser1",
  "access_key": "MJBEZLJ7BYG8XODXT71V",
  "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
  "swift_keys": [],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": [],
  "bucket_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "user_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "temp_url_keys": []}


2.   Create subuser.

radosgw-admin subuser create --uid=testuser1 --subuser=testuser1:swift 
--access=full
{ "user_id": "testuser1",
  "display_name": "Test User1",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [
{ "id": "testuser1:swift",
  "permissions": "full-control"}],
  "keys": [
{ "user": "testuser1:swift",
  "access_key": "HX9Q30EJWCZG825AT7B0",
  "secret_key": ""},
{ "user": "testuser1",
  "access_key": "MJBEZLJ7BYG8XODXT71V",
  "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
  "swift_keys": [],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": [],
  "bucket_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "user_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "temp_url_keys": []}


3.   Create key

radosgw-admin key create --subuser=testuser1:swift --key-type=swift --gen-secret
{ "user_id": "testuser1",
  "display_name": "Test User1",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [
{ "id": "testuser1:swift",
  "permissions": "full-control"}],
  "keys": [
{ "user": "testuser1:swift",
  "access_key": "HX9Q30EJWCZG825AT7B0",
  "secret_key": ""},
{ "user": "testuser1",
  "access_key": "MJBEZLJ7BYG8XODXT71V",
  "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
  "swift_keys": [
{ "user": "testuser1:swift",
  "secret_key": "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf"}],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": [],
  "bucket_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "user_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "temp_url_keys": []}

When I try and do anything using the credentials above, I get "Account not 
found" errors as per the example below:

swift -A https:///auth/1.0 -U testuser1:swift -K 
"KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf" list

That's the first thing.

Secondly, when I follow the process above to create a second user "testuser2", 
the user and subuser is created, however, when I try and generate a swift key 
for it, I get the following error:

radosgw-admin key create --subuser=testuser2:swift --key-type=swift --gen-secret
could not create key: unable to add access key, unable to store user info
2015-04-23 15:42:38.897090 7f38e157d820  0 WARNING: can't store user info, 
swift id () already mapped to another user (testuser2)

This suggests there is something wrong with the users or the configuration of 
the gateway somewhere.   Can someone provide some advice on what might be 
wrong, or where I can look to find out.   I have gone through whatever log 
files I can and don't see anything of any use at the moment.

Any help appreciated.

Thanks

Alistair

___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD Crush question.

2015-04-23 Thread Robert LeBlanc
If you force CRUSH to put copies in each rack, then you will be limited by
the smallest rack. You can have some sever limitations if you try to keep
your copies to two racks (see the thread titles "CRUSH rule for 3 replicas
across 2 hosts") for some of my explanation about this.

If I were you, I would install almost all the new hardware and hold out a
few pieces. Get the new hardware up and running, then take down some of the
original hardware and relocate it in the other cabinets so that you even
out the older lower capacity nodes and new higher capacity nodes in each
cabinet. That would give you the best of redundancy and performance (not
all PGs would have to have a replica on the potentially slower hardware).
This would allow you to have replication level three and able to lose a
rack.

Another options if you have the racks is to spread the new hardware over 3
racks instead of 2 so that your cluster is over 4 racks. CRUSH will give a
preference to the newer hardware (assuming the CRUSH weights reflect the
size of the disk) and you would no longer be limited by the older smaller
rack.

On Thu, Apr 23, 2015 at 3:20 AM, Rogier Dikkes 
wrote:

> Hello all,
>
> At this moment we have a scenario where i would like your opinion on.
>
> Scenario:
> Currently we have a ceph environment with 1 rack of hardware, this rack
> contains a couple of OSD nodes with 4T disks. In a few months time we will
> deploy 2 more racks with OSD nodes, these nodes have 6T disks and 1 node
> more per rack.
>
> Short overview:
> rack1: 4T OSD
> rack2: 6T OSD
> rack3: 6T OSD
>
> At this moment we are playing around with the idea to use the CRUSH map to
> make ceph 'rack aware' and ensure to have data replicated between racks.
> However from documentation i gathered i found that when you enforce data
> replication between buckets then your max storage size will be the lowest
> bucket value. My understanding: enforce the objects (size=3) to be
> replicated to 3 racks, the moment the rack with 4T OSD's is full we cannot
> store data anymore.
>
> Is this assumption correct?
>
> The current idea we play with:
>
> - Create 2 rack buckets
> - Create a ruleset to create 2 object replica’s for the 2x 6T buckets
> - Create a ruleset to create 1 object replica over all the hosts.
>
> This would result in 3 replicas of the object. Where we are sure that 2
> objects at least are in different racks. In the unlikely event of a rack
> failure we would have at least 1 or 2 replica’s left.
>
> Our idea is to have a crush rule with config that looks like:
>
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
>
>
>   host r01-cn01 {
>   id -1
>   alg straw
>   hash 0
>   item osd.0 weight 4.00
>   }
>
>   host r01-cn02 {
>   id -2
>   alg straw
>   hash 0
>   item osd.1 weight 4.00
>   }
>
>   host r01-cn03 {
>   id -3
>   alg straw
>   hash 0
>   item osd.3 weight 4.00
>   }
>
>   host r02-cn04 {
>   id -4
>   alg straw
>   hash 0
>   item osd.4 weight 6.00
>   }
>
>   host r02-cn05 {
>   id -5
>   alg straw
>   hash 0
>   item osd.5 weight 6.00
>   }
>
>   host r02-cn06 {
>   id -6
>   alg straw
>   hash 0
>   item osd.6 weight 6.00
>   }
>
>   host r03-cn07 {
>   id -7
>   alg straw
>   hash 0
>   item osd.7 weight 6.00
>   }
>
>   host r03-cn08 {
>   id -8
>   alg straw
>   hash 0
>   item osd.8 weight 6.00
>   }
>
>   host r03-cn09 {
>   id -9
>   alg straw
>   hash 0
>   item osd.9 weight 6.00
>   }
>
>   rack r02 {
>   id -10
>   alg straw
>   hash 0
>   item r02-cn04 weight 6.00
>   item r02-cn05 weight 6.00
>   item r02-cn06 weight 6.00
>   }
>
>   rack r03 {
>   id -11
>   alg straw
>   hash 0
>   item r03-cn07 weight 6.00
>   item r03-cn08 weight 6.00
>   item r03-cn09 weight 6.00
>   }
>
>   root 6t {
>   id -12
>   alg straw
>   hash 0
>   item r02 weight 18.00
>   item r03 weight 18.00
>   }
>
>   rule one {
>   ruleset 1
>   type replicated
>   min_size 1
>   max_size 10
>   step take 6t
>   step chooseleaf firstn 2 type rack
>   step chooseleaf firstn 1 type host
>   step emit
>   }
>
> Is this the ri

Re: [ceph-users] strange benchmark problem : restarting osd daemon improve performance from 100k iops to 300k iops

2015-04-23 Thread Somnath Roy
Alexandre,
You can configure with --with-jemalloc or ./do_autogen -J to build ceph with 
jemalloc.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Alexandre DERUMIER
Sent: Thursday, April 23, 2015 4:56 AM
To: Mark Nelson
Cc: ceph-users; ceph-devel; Milosz Tanski
Subject: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

>>If you have the means to compile the same version of ceph with 
>>jemalloc, I would be very interested to see how it does.

Yes, sure. (I have around 3-4 weeks to do all the benchs)

But I don't know how to do it ? 
I'm running the cluster on centos7.1, maybe it can be easy to patch the srpms 
to rebuild the package with jemalloc.



- Mail original -
De: "Mark Nelson" 
À: "aderumier" , "Srinivasula Maram" 

Cc: "ceph-users" , "ceph-devel" 
, "Milosz Tanski" 
Envoyé: Jeudi 23 Avril 2015 13:33:00
Objet: Re: [ceph-users] strange benchmark problem : restarting osd daemon 
improve performance from 100k iops to 300k iops

Thanks for the testing Alexandre! 

If you have the means to compile the same version of ceph with jemalloc, I 
would be very interested to see how it does. 

In some ways I'm glad it turned out not to be NUMA. I still suspect we will 
have to deal with it at some point, but perhaps not today. ;) 

Mark 

On 04/23/2015 05:58 AM, Alexandre DERUMIER wrote: 
> Maybe it's tcmalloc related
> I thinked to have patched it correctly, but perf show a lot of 
> tcmalloc::ThreadCache::ReleaseToCentralCache
> 
> before osd restart (100k)
> --
> 11.66% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::ThreadCache::ReleaseToCentralCache
> 8.51% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::FetchFromSpans
> 3.04% ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::ReleaseToSpans
> 2.04% ceph-osd libtcmalloc.so.4.1.2 [.] operator new 1.63% swapper 
> [kernel.kallsyms] [k] intel_idle 1.35% ceph-osd libtcmalloc.so.4.1.2 
> [.] tcmalloc::CentralFreeList::ReleaseListToSpans
> 1.33% ceph-osd libtcmalloc.so.4.1.2 [.] operator delete 1.07% ceph-osd 
> libstdc++.so.6.0.19 [.] std::basic_string std::char_traits, std::allocator >::basic_string 0.91% 
> ceph-osd libpthread-2.17.so [.] pthread_mutex_trylock 0.88% ceph-osd 
> libc-2.17.so [.] __memcpy_ssse3_back 0.81% ceph-osd ceph-osd [.] 
> Mutex::Lock 0.79% ceph-osd [kernel.kallsyms] [k] 
> copy_user_enhanced_fast_string 0.74% ceph-osd libpthread-2.17.so [.] 
> pthread_mutex_unlock 0.67% ceph-osd [kernel.kallsyms] [k] 
> _raw_spin_lock 0.63% swapper [kernel.kallsyms] [k] 
> native_write_msr_safe 0.62% ceph-osd [kernel.kallsyms] [k] 
> avc_has_perm_noaudit 0.58% ceph-osd ceph-osd [.] operator< 0.57% 
> ceph-osd [kernel.kallsyms] [k] __schedule 0.57% ceph-osd 
> [kernel.kallsyms] [k] __d_lookup_rcu 0.54% swapper [kernel.kallsyms] 
> [k] __schedule
> 
> 
> after osd restart (300k iops)
> --
> 3.47% ceph-osd libtcmalloc.so.4.1.2 [.] operator new 1.92% ceph-osd 
> libtcmalloc.so.4.1.2 [.] operator delete 1.86% swapper 
> [kernel.kallsyms] [k] intel_idle 1.52% ceph-osd libstdc++.so.6.0.19 
> [.] std::basic_string, 
> std::allocator >::basic_string 1.34% ceph-osd 
> libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache
> 1.24% ceph-osd libc-2.17.so [.] __memcpy_ssse3_back 1.23% ceph-osd 
> ceph-osd [.] Mutex::Lock 1.21% ceph-osd libpthread-2.17.so [.] 
> pthread_mutex_trylock 1.11% ceph-osd [kernel.kallsyms] [k] 
> copy_user_enhanced_fast_string 0.95% ceph-osd libpthread-2.17.so [.] 
> pthread_mutex_unlock 0.94% ceph-osd [kernel.kallsyms] [k] 
> _raw_spin_lock 0.78% ceph-osd [kernel.kallsyms] [k] __d_lookup_rcu 
> 0.70% ceph-osd [kernel.kallsyms] [k] tcp_sendmsg 0.70% ceph-osd 
> ceph-osd [.] Message::Message 0.68% ceph-osd [kernel.kallsyms] [k] 
> __schedule 0.66% ceph-osd [kernel.kallsyms] [k] idle_cpu 0.65% 
> ceph-osd libtcmalloc.so.4.1.2 [.] 
> tcmalloc::CentralFreeList::FetchFromSpans
> 0.64% swapper [kernel.kallsyms] [k] native_write_msr_safe 0.61% 
> ceph-osd ceph-osd [.] 
> std::tr1::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
> 0.60% swapper [kernel.kallsyms] [k] __schedule 0.60% ceph-osd 
> libstdc++.so.6.0.19 [.] 0x000bdd2b 0.57% ceph-osd ceph-osd [.] 
> operator< 0.57% ceph-osd ceph-osd [.] crc32_iscsi_00 0.56% ceph-osd 
> libstdc++.so.6.0.19 [.] std::string::_Rep::_M_dispose 0.55% ceph-osd 
> [kernel.kallsyms] [k] __switch_to 0.54% ceph-osd libc-2.17.so [.] 
> vfprintf 0.52% ceph-osd [kernel.kallsyms] [k] fget_light
> 
> - Mail original -
> De: "aderumier" 
> À: "Srinivasula Maram" 
> Cc: "ceph-users" , "ceph-devel" 
> , "Milosz Tanski" 
> Envoyé: Jeudi 23 Avril 2015 10:00:34
> Objet: Re: [ceph-users] strange benchmark problem : restarting osd 
> daemon improve performance from 100k iops to 300k iops
> 
> Hi,
> I'm hitting this bug again today. 
> 
> So don't seem to be numa rel

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Wed, Apr 22, 2015 at 4:30 PM, Nick Fisk  wrote:
> I suspect you are hitting problems with sync writes, which Ceph isn't known
> for being the fastest thing for.

There's "not being the fastest thing" and "an expensive cluster of
hardware that performs worse than a single SATA drive." :-(

> I'm not a big expert on ZFS but I do know that a SSD ZIL is normally
> recommended to allow fast sync writes.

The VM already has an ZIL vdisk on an SSD on the KVM host.

You may be thinking of NFS traffic, which is where ZFS and sync writes
have such a long and difficult history.  As far as I know, zfs receive
operations do not do sync writes (except possibly at the beginning and
end to establish barriers) since the operation as a whole already has
fundamental transactionality built in.

> SSD Ceph journals may give you around 200-300 iops

SSD journals are not an option for this cluster.  We just need to get
the most out of what we have.

There are some fio results showing this problem isn't limited/specific
to ZFS which I will post in a separate message shortly.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Wed, Apr 22, 2015 at 4:07 PM, Somnath Roy  wrote:
> I am suggesting synthetic workload like fio to run on top of VM to identify 
> where the bottleneck is. For example, if fio is giving decent enough output, 
> I guess ceph layer is doing fine. It is your client that is not driving 
> enough.

After spending a day learning about fio and collecting some data, here
are some results from a new VM running ext4 on a 3TB image backed by
this cluster.  The default ZFS block size is 128K and the snapshot
receive operation is basically write-only so the random 128K write
test is probably the closest to our workload and, indeed, performs
very similarly: ~20M/sec throughput and 153 IOPs.  (Although really
all the random write tests are pretty bad.)  The script used is at the
bottom, and is adapted from one I found on this list.

4K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=2893: Thu Apr 23 07:04:13 2015
  read : io=83563MB, bw=47538KB/s, iops=11884, runt=188msec
clat (usec): min=775, max=617891, avg=5381.74, stdev=3221.32
 lat (usec): min=776, max=617892, avg=5382.21, stdev=3221.31
clat percentiles (usec):
 |  1.00th=[ 2480],  5.00th=[ 2960], 10.00th=[ 3184], 20.00th=[ 3472],
 | 30.00th=[ 3696], 40.00th=[ 3952], 50.00th=[ 4320], 60.00th=[ 5536],
 | 70.00th=[ 6432], 80.00th=[ 7072], 90.00th=[ 7968], 95.00th=[ 9664],
 | 99.00th=[14400], 99.50th=[17792], 99.90th=[36096], 99.95th=[48384],
 | 99.99th=[96768]
bw (KB  /s): min=6, max= 1266, per=1.56%, avg=743.41, stdev=167.14
lat (usec) : 1000=0.01%
lat (msec) : 2=0.15%, 4=41.68%, 10=53.77%, 20=4.05%, 50=0.31%
lat (msec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu  : usr=0.11%, sys=0.45%, ctx=24873086, majf=0, minf=1843
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=21392078/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=83563MB, aggrb=47537KB/s, minb=47537KB/s, maxb=47537KB/s,
mint=188msec, maxt=188msec

Disk stats (read/write):
  vdb: ios=21391171/11968, merge=0/333, ticks=103378628/812140,
in_queue=104445000, util=100.00%

128K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3065: Thu Apr 23 07:34:14 2015
  read : io=393209MB, bw=223681KB/s, iops=1747, runt=1800090msec
clat (msec): min=2, max=5388, avg=36.62, stdev=68.58
 lat (msec): min=2, max=5388, avg=36.62, stdev=68.58
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[4], 20.00th=[4],
 | 30.00th=[5], 40.00th=[   11], 50.00th=[   18], 60.00th=[   28],
 | 70.00th=[   41], 80.00th=[   59], 90.00th=[   89], 95.00th=[  121],
 | 99.00th=[  227], 99.50th=[  326], 99.90th=[  840], 99.95th=[ 1172],
 | 99.99th=[ 2073]
bw (KB  /s): min=   24, max=36207, per=1.65%, avg=3690.81, stdev=3784.08
lat (msec) : 4=24.76%, 10=14.89%, 20=12.70%, 50=23.77%, 100=16.04%
lat (msec) : 250=6.99%, 500=0.58%, 750=0.13%, 1000=0.05%, 2000=0.06%
lat (msec) : >=2000=0.01%
  cpu  : usr=0.03%, sys=0.14%, ctx=3829128, majf=0, minf=3910
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=3145673/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=393209MB, aggrb=223681KB/s, minb=223681KB/s,
maxb=223681KB/s, mint=1800090msec, maxt=1800090msec

Disk stats (read/write):
  vdb: ios=3145617/6096, merge=0/170, ticks=95140524/632704,
in_queue=95776824, util=100.00%

8M sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3239: Thu Apr 23 08:04:16 2015
  read : io=393920MB, bw=223916KB/s, iops=27, runt=1801449msec
clat (msec): min=65, max=23189, avg=2340.68, stdev=1351.01
 lat (msec): min=65, max=23189, avg=2340.68, stdev=1351.01
clat percentiles (msec):
 |  1.00th=[  204],  5.00th=[  474], 10.00th=[  791], 20.00th=[ 1221],
 | 30.00th=[ 1532], 40.00th=[ 1844], 50.00th=[ 2147], 60.00th=[ 2474],
 | 70.00th=[ 2868], 80.00th=[ 3359], 90.00th=[ 4080], 95.00th=[ 4817],
 | 99.00th=[ 6456], 99.50th=[ 7177], 99.90th=[ 9110], 99.95th=[10159],
 | 99.99th=[12518]
bw (KB  /s): min=  353, max=52012, per=2.09%, avg=4674.81, stdev=3150.53
lat (msec) : 100=0.09%, 250=1.56%, 500=3.76%, 750=3.90%, 1000=4.99%
lat (msec) : 2000=30.78%, >=2000=54.92%
  cpu  : usr=0.00%, sys=0.23%, ctx=41395542, majf=0, minf=34821
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued 

Re: [ceph-users] cluster not coming up after reboot

2015-04-23 Thread Craig Lewis
On Thu, Apr 23, 2015 at 5:20 AM, Kenneth Waegeman
>
> So it is all fixed now, but is it explainable that at first about 90% of
> the OSDS going into shutdown over and over, and only after some time got in
> a stable situation, because of one host network failure ?
>
> Thanks again!


Yes, unless you've adjusted:
[global]
  mon osd min down reporters = 9
  mon osd min down reports = 12

OSDs talk to the MONs on the public network.  The cluster network is only
used for OSD to OSD communication.

If one OSD node can't talk on that network, the other nodes will tell the
MONs that it's OSDs are down.  And that node will also tell the MONs that
all the other OSDs are down.  Then the OSDs marked down will tell the MONs
that they're not down, and the cycle will repeat.

I'm somewhat surprised that your cluster eventually stabilized.


I have 8 OSDs per node.  I set my min down reporters high enough that no
single node can mark another node's OSDs down.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Swift and Ceph

2015-04-23 Thread Yehuda Sadeh-Weinraub
Sounds like you're hitting a known issue that was fixed a while back (although 
might not be fixed on the specific version you're running). Can you try 
creating a second subuser for the same user, see if that one works?

Yehuda

- Original Message -
> From: "alistair whittle" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, April 23, 2015 8:38:44 AM
> Subject: [ceph-users] Swift and Ceph
> 
> 
> 
> All,
> 
> 
> 
> I was hoping for some advice. I have recently built a Ceph cluster on RHEL
> 6.5 and have configured RGW. I want to test Swift API access, and as a
> result have created a user, swift subuser and swift keys as per the output
> below:
> 
> 
> 
> 1. Create user
> 
> 
> 
> radosgw-admin user create --uid="testuser1" --display-name="Test User1"
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [],
> 
> "keys": [
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> 2. Create subuser.
> 
> 
> 
> radosgw-admin subuser create --uid=testuser1 --subuser=testuser1:swift
> --access=full
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [
> 
> { "id": "testuser1:swift",
> 
> "permissions": "full-control"}],
> 
> "keys": [
> 
> { "user": "testuser1:swift",
> 
> "access_key": "HX9Q30EJWCZG825AT7B0",
> 
> "secret_key": ""},
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> 3. Create key
> 
> 
> 
> radosgw-admin key create --subuser=testuser1:swift --key-type=swift
> --gen-secret
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [
> 
> { "id": "testuser1:swift",
> 
> "permissions": "full-control"}],
> 
> "keys": [
> 
> { "user": "testuser1:swift",
> 
> "access_key": "HX9Q30EJWCZG825AT7B0",
> 
> "secret_key": ""},
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [
> 
> { "user": "testuser1:swift",
> 
> "secret_key": "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf"}],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> When I try and do anything using the credentials above, I get “Account not
> found” errors as per the example below:
> 
> 
> 
> swift -A https:///auth/1.0 -U testuser1:swift -K
> "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf" list
> 
> 
> 
> That’s the first thing.
> 
> 
> 
> Secondly, when I follow the process above to create a second user
> “testuser2”, the user and subuser is created, however, when I try and
> generate a swift key for it, I get the following error:
> 
> 
> 
> radosgw-admin key create --subuser=testuser2:swift --key-type=swift
> --gen-secret
> 
> could not create key: unable to add access key, unable to store user info
> 
> 2015-04-23 15:42:38.897090 7f38e157d820 0 WARNING: can't store user info,
> swift id () already mapped to another user (testuser2)
> 
> 
> 
> This suggests there is something wrong with the users or the configuration of
> the gateway somewhere. Can someone provide some advice on what might be
> wrong, or where I can look to find out. I have gone through whatever log
> files I can and don’t see anything of any use at the moment.
> 
> 
> 
> Any help appreciated.
> 
> 
> 
> Thanks
> 
> 
> 
> Alistair
> 
> 
> ___
> 
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to

Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-23 Thread Gregory Farnum
On Thu, Apr 23, 2015 at 12:55 AM, Steffen W Sørensen  wrote:
>> But in the menu, the use case "cephfs only" doesn't exist and I have
>> no idea of the %data for each pools metadata and data. So, what is
>> the proportion (approximatively) of %data between the "data" pool and
>> the "metadata" pool of cephfs in a cephfs-only cluster?
>>
>> Is it rather metadata=20%, data=80%?
>> Is it rather metadata=10%, data=90%?
>> Is it rather metadata= 5%, data=95%?
>> etc.
> Assuming miles vary here, depending on your ratio between number of entries 
> in your Ceph FS vs their sizes, eg. many small files vs few large ones.
> So you are properly the best one to estimate this your self :)


Yeah. The metadata pool will contain:
1) MDS logs, which I think by default will take up to 200MB per
logical MDS. (You should have only one logical MDS.)
2) directory metadata objects, which contain the dentries and inodes
of the system; ~4KB is probably generous for each?
3) Some smaller data structures about the allocated inode range and
current client sessions.

The data pool contains all of the file data. Presumably this is much
larger, but it will depend on your average file size and we've not
done any real study of it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "Compacting" btrfs file storage

2015-04-23 Thread Gregory Farnum
On Thu, Apr 23, 2015 at 1:25 AM, Burkhard Linke
 wrote:
> Hi,
>
> I've noticed that the btrfs file storage is performing some
> cleanup/compacting operations during OSD startup.
>
> Before OSD start:
> /dev/sdc1  2.8T  2.4T  390G  87% /var/lib/ceph/osd/ceph-58
>
> After OSD start:
> /dev/sdc1  2.8T  2.2T  629G  78% /var/lib/ceph/osd/ceph-58
>
> OSDs are configured with firefly default settings.
>
> This "compacting" of the underlying storage happens during the PG loading
> phase of the OSD start.
>
> Is it possible to trigger this compacting without restarting the OSD?

This looks to me less like btrfs and more like the OSD
"compact_on_mount" style options. Those default to false right now,
but maybe it's set to true in your version — you should explore the
archives for threads about that. I'm not sure if there's a way to do
it online, but at a quick grep it looks to me like there isn't.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing a ceph fs

2015-04-23 Thread Gregory Farnum
I think you have to "ceph mds fail" the last one up, then you'll be
able to remove it.
-Greg

On Thu, Apr 23, 2015 at 7:52 AM, Kenneth Waegeman
 wrote:
>
>
> On 04/22/2015 06:51 PM, Gregory Farnum wrote:
>>
>> If you look at the "ceph --help" output you'll find some commands for
>> removing MDSes from the system.
>
>
> Yes, this works for all but the last mds..
>
> [root@mds01 ~]# ceph mds rm 35632 mds.mds03
> Error EBUSY: cannot remove active mds.mds03 rank 0
>
> I stopped the daemon, checked the process was stopped, even did a shutdown
> of that mds server, I keep getting this message and am unable to remove the
> fs ..
>
> log file has this:
>
> 2015-04-23 16:14:05.171450 7fa9fe799700 -1 mds.0.4 *** got signal Terminated
> ***
> 2015-04-23 16:14:05.171490 7fa9fe799700  1 mds.0.4 suicide.  wanted
> down:dne, now up:active
>
>
>
>> -Greg
>> On Wed, Apr 22, 2015 at 6:46 AM Kenneth Waegeman
>> mailto:kenneth.waege...@ugent.be>> wrote:
>>
>> forgot to mention I'm running 0.94.1
>>
>> On 04/22/2015 03:02 PM, Kenneth Waegeman wrote:
>>  > Hi,
>>  >
>>  > I tried to recreate a ceph fs ( well actually an underlying pool,
>> but
>>  > for that I need to first remove the fs) , but this seems not that
>> easy
>>  > to achieve.
>>  >
>>  > When I run
>>  > `ceph fs rm ceph_fs`
>>  > I get:
>>  > `Error EINVAL: all MDS daemons must be inactive before removing
>> filesystem`
>>  >
>>  > I stopped the 3 MDSs, but this doesn't change anything, as ceph
>> health
>>  > still "thinks" there is an mds running laggy:
>>  >
>>  >   health HEALTH_WARN
>>  >  mds cluster is degraded
>>  >  mds mds03 is laggy
>>  >   monmap e1: 3 mons at ...
>>  >  election epoch 12, quorum 0,1,2 mds01,mds02,mds03
>>  >   mdsmap e12: 1/1/1 up {0=mds03=up:replay(laggy or crashed)}
>>  >
>>  > I checked the mds processes are gone..
>>  >
>>  > Someone knows a solution for this?
>>  >
>>  > Thanks!
>>  > Kenneth
>>  > ___
>>  > ceph-users mailing list
>>  > ceph-users@lists.ceph.com 
>>  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Swift and Ceph

2015-04-23 Thread alistair.whittle
Can you explain this a bit more?   You mean try and create a second subuser for 
testuser1 or testuser2?

As an aside, I am running Ceph 0.80.7 as is packaged with ICE 1.2.2.  I believe 
that is the Firefly release.


-Original Message-
From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com] 
Sent: Thursday, April 23, 2015 6:18 PM
To: Whittle, Alistair: Investment Bank (LDN)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Swift and Ceph

Sounds like you're hitting a known issue that was fixed a while back (although 
might not be fixed on the specific version you're running). Can you try 
creating a second subuser for the same user, see if that one works?

Yehuda

- Original Message -
> From: "alistair whittle" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, April 23, 2015 8:38:44 AM
> Subject: [ceph-users] Swift and Ceph
> 
> 
> 
> All,
> 
> 
> 
> I was hoping for some advice. I have recently built a Ceph cluster on 
> RHEL
> 6.5 and have configured RGW. I want to test Swift API access, and as a 
> result have created a user, swift subuser and swift keys as per the 
> output
> below:
> 
> 
> 
> 1. Create user
> 
> 
> 
> radosgw-admin user create --uid="testuser1" --display-name="Test User1"
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [],
> 
> "keys": [
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> 2. Create subuser.
> 
> 
> 
> radosgw-admin subuser create --uid=testuser1 --subuser=testuser1:swift 
> --access=full
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [
> 
> { "id": "testuser1:swift",
> 
> "permissions": "full-control"}],
> 
> "keys": [
> 
> { "user": "testuser1:swift",
> 
> "access_key": "HX9Q30EJWCZG825AT7B0",
> 
> "secret_key": ""},
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> 3. Create key
> 
> 
> 
> radosgw-admin key create --subuser=testuser1:swift --key-type=swift 
> --gen-secret
> 
> { "user_id": "testuser1",
> 
> "display_name": "Test User1",
> 
> "email": "",
> 
> "suspended": 0,
> 
> "max_buckets": 1000,
> 
> "auid": 0,
> 
> "subusers": [
> 
> { "id": "testuser1:swift",
> 
> "permissions": "full-control"}],
> 
> "keys": [
> 
> { "user": "testuser1:swift",
> 
> "access_key": "HX9Q30EJWCZG825AT7B0",
> 
> "secret_key": ""},
> 
> { "user": "testuser1",
> 
> "access_key": "MJBEZLJ7BYG8XODXT71V",
> 
> "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> 
> "swift_keys": [
> 
> { "user": "testuser1:swift",
> 
> "secret_key": "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf"}],
> 
> "caps": [],
> 
> "op_mask": "read, write, delete",
> 
> "default_placement": "",
> 
> "placement_tags": [],
> 
> "bucket_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "user_quota": { "enabled": false,
> 
> "max_size_kb": -1,
> 
> "max_objects": -1},
> 
> "temp_url_keys": []}
> 
> 
> 
> When I try and do anything using the credentials above, I get “Account 
> not found” errors as per the example below:
> 
> 
> 
> swift -A https:///auth/1.0 -U testuser1:swift -K 
> "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf" list
> 
> 
> 
> That’s the first thing.
> 
> 
> 
> Secondly, when I follow the process above to create a second user 
> “testuser2”, the user and subuser is created, however, when I try and 
> generate a swift key for it, I get the following error:
> 
> 
> 
> radosgw-admin key create --subuser=testuser2:swift --key-type=swift 
> --gen-secret
> 
> could not create key: unable to add access key, unable to store user 
> info
> 
> 2015-04-23 15:42:38.897090 7f38e157d820 0 WARNING: can't store user 
> info, swift id () already mapped to another user (testuser2)
> 
> 
> 
> This suggests there is something wrong with the users or the 
> configuration of the gateway somewhere. Can someone provide some 
> advice on what might be wrong, or where I can look to find out. I have 
> gone through whatever log files I can and don’t see anything of any use at 
> the momen

Re: [ceph-users] Swift and Ceph

2015-04-23 Thread Yehuda Sadeh-Weinraub
I think you're hitting issue #8587 (http://tracker.ceph.com/issues/8587). This 
issue has been fixed at 0.80.8, so you might want to upgrade to that version 
(available with ICE 1.2.3).

Yehuda

- Original Message -
> From: "alistair whittle" 
> To: yeh...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, April 23, 2015 10:47:28 AM
> Subject: Re: [ceph-users] Swift and Ceph
> 
> Can you explain this a bit more?   You mean try and create a second subuser
> for testuser1 or testuser2?
> 
> As an aside, I am running Ceph 0.80.7 as is packaged with ICE 1.2.2.  I
> believe that is the Firefly release.
> 
> 
> -Original Message-
> From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com]
> Sent: Thursday, April 23, 2015 6:18 PM
> To: Whittle, Alistair: Investment Bank (LDN)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Swift and Ceph
> 
> Sounds like you're hitting a known issue that was fixed a while back
> (although might not be fixed on the specific version you're running). Can
> you try creating a second subuser for the same user, see if that one works?
> 
> Yehuda
> 
> - Original Message -
> > From: "alistair whittle" 
> > To: ceph-users@lists.ceph.com
> > Sent: Thursday, April 23, 2015 8:38:44 AM
> > Subject: [ceph-users] Swift and Ceph
> > 
> > 
> > 
> > All,
> > 
> > 
> > 
> > I was hoping for some advice. I have recently built a Ceph cluster on
> > RHEL
> > 6.5 and have configured RGW. I want to test Swift API access, and as a
> > result have created a user, swift subuser and swift keys as per the
> > output
> > below:
> > 
> > 
> > 
> > 1. Create user
> > 
> > 
> > 
> > radosgw-admin user create --uid="testuser1" --display-name="Test User1"
> > 
> > { "user_id": "testuser1",
> > 
> > "display_name": "Test User1",
> > 
> > "email": "",
> > 
> > "suspended": 0,
> > 
> > "max_buckets": 1000,
> > 
> > "auid": 0,
> > 
> > "subusers": [],
> > 
> > "keys": [
> > 
> > { "user": "testuser1",
> > 
> > "access_key": "MJBEZLJ7BYG8XODXT71V",
> > 
> > "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> > 
> > "swift_keys": [],
> > 
> > "caps": [],
> > 
> > "op_mask": "read, write, delete",
> > 
> > "default_placement": "",
> > 
> > "placement_tags": [],
> > 
> > "bucket_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "user_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "temp_url_keys": []}
> > 
> > 
> > 
> > 2. Create subuser.
> > 
> > 
> > 
> > radosgw-admin subuser create --uid=testuser1 --subuser=testuser1:swift
> > --access=full
> > 
> > { "user_id": "testuser1",
> > 
> > "display_name": "Test User1",
> > 
> > "email": "",
> > 
> > "suspended": 0,
> > 
> > "max_buckets": 1000,
> > 
> > "auid": 0,
> > 
> > "subusers": [
> > 
> > { "id": "testuser1:swift",
> > 
> > "permissions": "full-control"}],
> > 
> > "keys": [
> > 
> > { "user": "testuser1:swift",
> > 
> > "access_key": "HX9Q30EJWCZG825AT7B0",
> > 
> > "secret_key": ""},
> > 
> > { "user": "testuser1",
> > 
> > "access_key": "MJBEZLJ7BYG8XODXT71V",
> > 
> > "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> > 
> > "swift_keys": [],
> > 
> > "caps": [],
> > 
> > "op_mask": "read, write, delete",
> > 
> > "default_placement": "",
> > 
> > "placement_tags": [],
> > 
> > "bucket_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "user_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "temp_url_keys": []}
> > 
> > 
> > 
> > 3. Create key
> > 
> > 
> > 
> > radosgw-admin key create --subuser=testuser1:swift --key-type=swift
> > --gen-secret
> > 
> > { "user_id": "testuser1",
> > 
> > "display_name": "Test User1",
> > 
> > "email": "",
> > 
> > "suspended": 0,
> > 
> > "max_buckets": 1000,
> > 
> > "auid": 0,
> > 
> > "subusers": [
> > 
> > { "id": "testuser1:swift",
> > 
> > "permissions": "full-control"}],
> > 
> > "keys": [
> > 
> > { "user": "testuser1:swift",
> > 
> > "access_key": "HX9Q30EJWCZG825AT7B0",
> > 
> > "secret_key": ""},
> > 
> > { "user": "testuser1",
> > 
> > "access_key": "MJBEZLJ7BYG8XODXT71V",
> > 
> > "secret_key": "tGnsm8JeEgPGAy1MGCKSVVoSIEs8iWNUOgiJ981p"}],
> > 
> > "swift_keys": [
> > 
> > { "user": "testuser1:swift",
> > 
> > "secret_key": "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf"}],
> > 
> > "caps": [],
> > 
> > "op_mask": "read, write, delete",
> > 
> > "default_placement": "",
> > 
> > "placement_tags": [],
> > 
> > "bucket_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "user_quota": { "enabled": false,
> > 
> > "max_size_kb": -1,
> > 
> > "max_objects": -1},
> > 
> > "temp_url_keys": []}
> > 
> > 
> > 
> > When I try and do anything using the credentials above, I get “Account
> > not found” errors as per the example below:
> > 
> > 
> > 
> > swift -A https:///auth/1.0 -U testuser1:swift -K
> > "KpQCfPLstJhSMsR9qUzY9WfA1ebO4x7VRXkr1KSf" list
> >

[ceph-users] Serving multiple applications with a single cluster

2015-04-23 Thread Rafael Coninck Teigão

Hello everyone.

I'm new to the list and also just a beginner at using Ceph, and I'd like 
to get some advice from you on how to create the right infrastructure 
for our scenario.


We'd like to provide storage to three different applications, but each 
should have its own "area". Also, ideally we'd like to avoid using RGW, 
so that we can deploy the new storage without changing the applications 
too much.


Is it possible to accomplish this with a single cluster? I know I won't 
be able to have multiple CephFS with decent isolation 
(https://wiki.ceph.com/Planning/Sideboard/Client_Security_for_CephFS) 
and that running multiple clusters on the same hardware involves 
changing all the TCP ports for each instance.


I guess the perfect solution for us would be able to create different 
pools and serve them in different CephFS configurations, but that's not 
possible as of now right?


How would you go in configuring Ceph for this scenario?

Thanks,
Rafael.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Somnath Roy
David,
With the similar 128K profile I am getting ~200MB/s bandwidth with entire OSD 
on SSD..I never tested with HDDs, but, it seems you are reaching Ceph's limit 
on this. Probably, nothing wrong in your setup !

Thanks & Regards
Somnath

-Original Message-
From: jdavidli...@gmail.com [mailto:jdavidli...@gmail.com] On Behalf Of J David
Sent: Thursday, April 23, 2015 9:56 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

On Wed, Apr 22, 2015 at 4:07 PM, Somnath Roy  wrote:
> I am suggesting synthetic workload like fio to run on top of VM to identify 
> where the bottleneck is. For example, if fio is giving decent enough output, 
> I guess ceph layer is doing fine. It is your client that is not driving 
> enough.

After spending a day learning about fio and collecting some data, here are some 
results from a new VM running ext4 on a 3TB image backed by this cluster.  The 
default ZFS block size is 128K and the snapshot receive operation is basically 
write-only so the random 128K write test is probably the closest to our 
workload and, indeed, performs very similarly: ~20M/sec throughput and 153 
IOPs.  (Although really all the random write tests are pretty bad.)  The script 
used is at the bottom, and is adapted from one I found on this list.

4K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=2893: Thu Apr 23 07:04:13 2015
  read : io=83563MB, bw=47538KB/s, iops=11884, runt=188msec
clat (usec): min=775, max=617891, avg=5381.74, stdev=3221.32
 lat (usec): min=776, max=617892, avg=5382.21, stdev=3221.31
clat percentiles (usec):
 |  1.00th=[ 2480],  5.00th=[ 2960], 10.00th=[ 3184], 20.00th=[ 3472],
 | 30.00th=[ 3696], 40.00th=[ 3952], 50.00th=[ 4320], 60.00th=[ 5536],
 | 70.00th=[ 6432], 80.00th=[ 7072], 90.00th=[ 7968], 95.00th=[ 9664],
 | 99.00th=[14400], 99.50th=[17792], 99.90th=[36096], 99.95th=[48384],
 | 99.99th=[96768]
bw (KB  /s): min=6, max= 1266, per=1.56%, avg=743.41, stdev=167.14
lat (usec) : 1000=0.01%
lat (msec) : 2=0.15%, 4=41.68%, 10=53.77%, 20=4.05%, 50=0.31%
lat (msec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu  : usr=0.11%, sys=0.45%, ctx=24873086, majf=0, minf=1843
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=21392078/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=83563MB, aggrb=47537KB/s, minb=47537KB/s, maxb=47537KB/s, 
mint=188msec, maxt=188msec

Disk stats (read/write):
  vdb: ios=21391171/11968, merge=0/333, ticks=103378628/812140, 
in_queue=104445000, util=100.00%

128K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3065: Thu Apr 23 07:34:14 2015
  read : io=393209MB, bw=223681KB/s, iops=1747, runt=1800090msec
clat (msec): min=2, max=5388, avg=36.62, stdev=68.58
 lat (msec): min=2, max=5388, avg=36.62, stdev=68.58
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[4], 20.00th=[4],
 | 30.00th=[5], 40.00th=[   11], 50.00th=[   18], 60.00th=[   28],
 | 70.00th=[   41], 80.00th=[   59], 90.00th=[   89], 95.00th=[  121],
 | 99.00th=[  227], 99.50th=[  326], 99.90th=[  840], 99.95th=[ 1172],
 | 99.99th=[ 2073]
bw (KB  /s): min=   24, max=36207, per=1.65%, avg=3690.81, stdev=3784.08
lat (msec) : 4=24.76%, 10=14.89%, 20=12.70%, 50=23.77%, 100=16.04%
lat (msec) : 250=6.99%, 500=0.58%, 750=0.13%, 1000=0.05%, 2000=0.06%
lat (msec) : >=2000=0.01%
  cpu  : usr=0.03%, sys=0.14%, ctx=3829128, majf=0, minf=3910
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=3145673/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=393209MB, aggrb=223681KB/s, minb=223681KB/s, maxb=223681KB/s, 
mint=1800090msec, maxt=1800090msec

Disk stats (read/write):
  vdb: ios=3145617/6096, merge=0/170, ticks=95140524/632704, in_queue=95776824, 
util=100.00%

8M sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3239: Thu Apr 23 08:04:16 2015
  read : io=393920MB, bw=223916KB/s, iops=27, runt=1801449msec
clat (msec): min=65, max=23189, avg=2340.68, stdev=1351.01
 lat (msec): min=65, max=23189, avg=2340.68, stdev=1351.01
clat percentiles (msec):
 |  1.00th=[  204],  5.00th=[  474], 10.00th=[  791], 20.00th=[ 1221],
 | 30.00th=[ 1532], 40.00th=[ 1844], 50.00th=[ 2147], 60.00th=[ 2474],
 | 70.00th=[ 2868], 80.00th=[ 3359], 90.00th=[ 4080], 95.00th=[ 4817],
 | 99.00th=[ 6456], 99.50th=[ 7177], 99.90th=[ 9110], 99.95th=[10159],
 | 99.99th=[12518]
bw

Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> J David
> Sent: 23 April 2015 17:51
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Having trouble getting good performance
> 
> On Wed, Apr 22, 2015 at 4:30 PM, Nick Fisk  wrote:
> > I suspect you are hitting problems with sync writes, which Ceph isn't
> > known for being the fastest thing for.
> 
> There's "not being the fastest thing" and "an expensive cluster of
hardware
> that performs worse than a single SATA drive." :-(
> 

If you are doing single threaded writes (not saying you are, but...) you
will get worse performance than a SATA drive as a write to each OSD will
actually involve 2-3x the number of source IO's, which is why SSD journals
help so much. A single threaded operation without SSD journals will probably
max out around 40-50 iops, this will then scale roughly in a linear fashion
until you hit iodepth=#disks/#replicas, where performance will then start to
tail off as latency increases.

If you can let us know the avg queue depth that ZFS is generating that will
probably give a good estimation of what you can expect from the cluster.

I'm using Ceph with ESXi over iSCSI which is a shocker for sync writes. A
single thread gets awful performance, however when hundreds of things start
happening at once, Ceph smiles and IOPs go into the thousands.

> > I'm not a big expert on ZFS but I do know that a SSD ZIL is normally
> > recommended to allow fast sync writes.
> 
> The VM already has an ZIL vdisk on an SSD on the KVM host.
> 
> You may be thinking of NFS traffic, which is where ZFS and sync writes
have
> such a long and difficult history.  As far as I know, zfs receive
operations do
> not do sync writes (except possibly at the beginning and end to establish
> barriers) since the operation as a whole already has fundamental
> transactionality built in.
> 
> > SSD Ceph journals may give you around 200-300 iops
> 
> SSD journals are not an option for this cluster.  We just need to get the
most
> out of what we have.
> 
> There are some fio results showing this problem isn't limited/specific to
ZFS
> which I will post in a separate message shortly.
> 

I have had a look through the fio runs, could you also try and run a couple
of jobs with iodepth=64 instead of numjobs=64. I know they should do the
same thing, but the numbers with the former are easier to understand. 

It may also be worth getting hold of the RBD enabled fio so that you can
test performance outside of the VM.

> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Somnath Roy
BTW, I am not writing any replicas with the below numbers..
Performance will degrade more based on replica numbers..How many replicas are 
you writing ?

Thanks & Regards
Somnath

-Original Message-
From: Somnath Roy
Sent: Thursday, April 23, 2015 12:04 PM
To: 'J David'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Having trouble getting good performance

David,
With the similar 128K profile I am getting ~200MB/s bandwidth with entire OSD 
on SSD..I never tested with HDDs, but, it seems you are reaching Ceph's limit 
on this. Probably, nothing wrong in your setup !

Thanks & Regards
Somnath

-Original Message-
From: jdavidli...@gmail.com [mailto:jdavidli...@gmail.com] On Behalf Of J David
Sent: Thursday, April 23, 2015 9:56 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

On Wed, Apr 22, 2015 at 4:07 PM, Somnath Roy  wrote:
> I am suggesting synthetic workload like fio to run on top of VM to identify 
> where the bottleneck is. For example, if fio is giving decent enough output, 
> I guess ceph layer is doing fine. It is your client that is not driving 
> enough.

After spending a day learning about fio and collecting some data, here are some 
results from a new VM running ext4 on a 3TB image backed by this cluster.  The 
default ZFS block size is 128K and the snapshot receive operation is basically 
write-only so the random 128K write test is probably the closest to our 
workload and, indeed, performs very similarly: ~20M/sec throughput and 153 
IOPs.  (Although really all the random write tests are pretty bad.)  The script 
used is at the bottom, and is adapted from one I found on this list.

4K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=2893: Thu Apr 23 07:04:13 2015
  read : io=83563MB, bw=47538KB/s, iops=11884, runt=188msec
clat (usec): min=775, max=617891, avg=5381.74, stdev=3221.32
 lat (usec): min=776, max=617892, avg=5382.21, stdev=3221.31
clat percentiles (usec):
 |  1.00th=[ 2480],  5.00th=[ 2960], 10.00th=[ 3184], 20.00th=[ 3472],
 | 30.00th=[ 3696], 40.00th=[ 3952], 50.00th=[ 4320], 60.00th=[ 5536],
 | 70.00th=[ 6432], 80.00th=[ 7072], 90.00th=[ 7968], 95.00th=[ 9664],
 | 99.00th=[14400], 99.50th=[17792], 99.90th=[36096], 99.95th=[48384],
 | 99.99th=[96768]
bw (KB  /s): min=6, max= 1266, per=1.56%, avg=743.41, stdev=167.14
lat (usec) : 1000=0.01%
lat (msec) : 2=0.15%, 4=41.68%, 10=53.77%, 20=4.05%, 50=0.31%
lat (msec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu  : usr=0.11%, sys=0.45%, ctx=24873086, majf=0, minf=1843
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=21392078/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=83563MB, aggrb=47537KB/s, minb=47537KB/s, maxb=47537KB/s, 
mint=188msec, maxt=188msec

Disk stats (read/write):
  vdb: ios=21391171/11968, merge=0/333, ticks=103378628/812140, 
in_queue=104445000, util=100.00%

128K sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3065: Thu Apr 23 07:34:14 2015
  read : io=393209MB, bw=223681KB/s, iops=1747, runt=1800090msec
clat (msec): min=2, max=5388, avg=36.62, stdev=68.58
 lat (msec): min=2, max=5388, avg=36.62, stdev=68.58
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[4], 20.00th=[4],
 | 30.00th=[5], 40.00th=[   11], 50.00th=[   18], 60.00th=[   28],
 | 70.00th=[   41], 80.00th=[   59], 90.00th=[   89], 95.00th=[  121],
 | 99.00th=[  227], 99.50th=[  326], 99.90th=[  840], 99.95th=[ 1172],
 | 99.99th=[ 2073]
bw (KB  /s): min=   24, max=36207, per=1.65%, avg=3690.81, stdev=3784.08
lat (msec) : 4=24.76%, 10=14.89%, 20=12.70%, 50=23.77%, 100=16.04%
lat (msec) : 250=6.99%, 500=0.58%, 750=0.13%, 1000=0.05%, 2000=0.06%
lat (msec) : >=2000=0.01%
  cpu  : usr=0.03%, sys=0.14%, ctx=3829128, majf=0, minf=3910
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=3145673/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=393209MB, aggrb=223681KB/s, minb=223681KB/s, maxb=223681KB/s, 
mint=1800090msec, maxt=1800090msec

Disk stats (read/write):
  vdb: ios=3145617/6096, merge=0/170, ticks=95140524/632704, in_queue=95776824, 
util=100.00%

8M sequential read:

testfile: (groupid=0, jobs=64): err= 0: pid=3239: Thu Apr 23 08:04:16 2015
  read : io=393920MB, bw=223916KB/s, iops=27, runt=1801449msec
clat (msec): min=65, max=23189, avg=2340.68, stdev=1351.01
 lat (msec): min=65, max=23189, avg=2340.

Re: [ceph-users] rados cppool

2015-04-23 Thread Pavel V. Kaygorodov
Hi!

I have copied two of my pools recently, because old ones has too many pgs.
Both of them contains RBD images, with 1GB and ~30GB of data.
Both pools was copied without errors, RBD images are mountable and seems to be 
fine.
CEPH version is 0.94.1

Pavel.
 

> 7 апр. 2015 г., в 18:29, Kapil Sharma  написал(а):
> 
> Hi folks,
> 
> I will really appreciate if someone could try "rados cppool  
> "
> command on their Hammer ceph cluster. It throws an error for me, not sure if 
> this is
> an upstream issue or something related to our distro only.
> 
> error trace- http://pastebin.com/gVkbiPLa
> 
> This works fine for me in my firefly cluster.
> 
> -- 
> Regards,
> Kapil.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk  wrote:
> If you can let us know the avg queue depth that ZFS is generating that will
> probably give a good estimation of what you can expect from the cluster.

How would that be measured?

> I have had a look through the fio runs, could you also try and run a couple
> of jobs with iodepth=64 instead of numjobs=64. I know they should do the
> same thing, but the numbers with the former are easier to understand.

That should be no problem; results should be available in a couple of hours.

> It may also be worth getting hold of the RBD enabled fio so that you can
> test performance outside of the VM.

Unless it comes in an apt-get-able package, that might be more of an issue. :)

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool

2015-04-23 Thread Sage Weil
On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote:
> Hi!
> 
> I have copied two of my pools recently, because old ones has too many pgs.
> Both of them contains RBD images, with 1GB and ~30GB of data.
> Both pools was copied without errors, RBD images are mountable and seems to 
> be fine.
> CEPH version is 0.94.1

You will likely have problems if you try to delete snapshots that existed 
on the images (snaps are not copied/preserved by cppool).

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> J David
> Sent: 23 April 2015 20:19
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Having trouble getting good performance
> 
> On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk  wrote:
> > If you can let us know the avg queue depth that ZFS is generating that
> > will probably give a good estimation of what you can expect from the
> cluster.
> 
> How would that be measured?

Hopefully you should be able to see this by running iostat in the VM and
looking at the device which contains the ZFS volume.

> 
> > I have had a look through the fio runs, could you also try and run a
> > couple of jobs with iodepth=64 instead of numjobs=64. I know they
> > should do the same thing, but the numbers with the former are easier to
> understand.
> 
> That should be no problem; results should be available in a couple of
hours.
> 
> > It may also be worth getting hold of the RBD enabled fio so that you
> > can test performance outside of the VM.
> 
> Unless it comes in an apt-get-able package, that might be more of an
issue. :)

I think I managed the install the vivid version in Ubuntu trusty

http://packages.ubuntu.com/vivid/fio



> 
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serving multiple applications with a single cluster

2015-04-23 Thread Nick Fisk
Hi Rafael,

Do you require a shared FS for these applications or would a block device
with a traditional filesystem be suitable?

If it is, then you could create separate pools with a RBD block device in
each.

Just out of interest what is the reason for separation, security or
performance?

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Rafael Coninck Teigão
> Sent: 23 April 2015 19:39
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Serving multiple applications with a single cluster
> 
> Hello everyone.
> 
> I'm new to the list and also just a beginner at using Ceph, and I'd like
to get
> some advice from you on how to create the right infrastructure for our
> scenario.
> 
> We'd like to provide storage to three different applications, but each
should
> have its own "area". Also, ideally we'd like to avoid using RGW, so that
we
> can deploy the new storage without changing the applications too much.
> 
> Is it possible to accomplish this with a single cluster? I know I won't be
able to
> have multiple CephFS with decent isolation
> (https://wiki.ceph.com/Planning/Sideboard/Client_Security_for_CephFS)
> and that running multiple clusters on the same hardware involves changing
> all the TCP ports for each instance.
> 
> I guess the perfect solution for us would be able to create different
pools
> and serve them in different CephFS configurations, but that's not possible
as
> of now right?
> 
> How would you go in configuring Ceph for this scenario?
> 
> Thanks,
> Rafael.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Wiki

2015-04-23 Thread Patrick McGarry
Hey cephers,

Just wanted to let you all know that the OAuth portion of the wiki
login has been removed in favor of stand-alone auth for now. Our plan
longer-term is to replace the wiki with something that will scale with
us a bit better and be more open. If you never set a password you
should still be able to do a password recovery based on your original
email.

If you have any questions or concerns please let me know. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk  wrote:
> I have had a look through the fio runs, could you also try and run a couple
> of jobs with iodepth=64 instead of numjobs=64. I know they should do the
> same thing, but the numbers with the former are easier to understand.

Maybe it's an issue of interpretation, but this doesn't actually seem to work:

testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
ioengine=sync, iodepth=64

fio-2.1.3

Starting 1 process


testfile: (groupid=0, jobs=1): err= 0: pid=5967: Thu Apr 23 19:40:14 2015

  write: io=5762.7MB, bw=3278.3KB/s, iops=25, runt=1800048msec

clat (msec): min=5, max=807, avg=39.03, stdev=59.24

 lat (msec): min=5, max=807, avg=39.04, stdev=59.24

clat percentiles (msec):

 |  1.00th=[6],  5.00th=[7], 10.00th=[7], 20.00th=[8],

 | 30.00th=[   10], 40.00th=[   13], 50.00th=[   19], 60.00th=[   27],

 | 70.00th=[   37], 80.00th=[   50], 90.00th=[   92], 95.00th=[  149],

 | 99.00th=[  306], 99.50th=[  416], 99.90th=[  545], 99.95th=[  586],

 | 99.99th=[  725]

bw (KB  /s): min=  214, max=12142, per=100.00%, avg=3352.50, stdev=1595.13

lat (msec) : 10=32.27%, 20=19.42%, 50=28.25%, 100=10.94%, 250=7.67%

lat (msec) : 500=1.25%, 750=0.20%, 1000=0.01%

  cpu  : usr=0.06%, sys=0.18%, ctx=46686, majf=0, minf=29

  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

 issued: total=r=0/w=46101/d=0, short=r=0/w=0/d=0


Run status group 0 (all jobs):

  WRITE: io=5762.7MB, aggrb=3278KB/s, minb=3278KB/s, maxb=3278KB/s,
mint=1800048msec, maxt=1800048msec


Disk stats (read/write):

  vdb: ios=0/46809, merge=0/355, ticks=0/1837916, in_queue=1837916, util=99.95%


It's like it didn't honor the setting.  (And whew, 25 iops & 3M/sec, ouch.)

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Mark Nelson



On 04/23/2015 03:22 PM, J David wrote:

On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk  wrote:

I have had a look through the fio runs, could you also try and run a couple
of jobs with iodepth=64 instead of numjobs=64. I know they should do the
same thing, but the numbers with the former are easier to understand.


Maybe it's an issue of interpretation, but this doesn't actually seem to work:

testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
ioengine=sync, iodepth=64


If you want to adjust the iodepth, you'll need to use an asynchronous 
ioengine like libaio (you also need to use direct=1)


Mark



Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread Nick Fisk




> -Original Message-
> From: jdavidli...@gmail.com [mailto:jdavidli...@gmail.com] On Behalf Of J
> David
> Sent: 23 April 2015 21:22
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Having trouble getting good performance
> 
> On Thu, Apr 23, 2015 at 3:05 PM, Nick Fisk  wrote:
> > I have had a look through the fio runs, could you also try and run a
> > couple of jobs with iodepth=64 instead of numjobs=64. I know they
> > should do the same thing, but the numbers with the former are easier to
> understand.
> 
> Maybe it's an issue of interpretation, but this doesn't actually seem to work:
> 
> testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
> ioengine=sync, iodepth=64

Try setting the engine=libaio

I'm not sure if the sync engine supports io depths >1, also I think you might 
need direct=1

> 
> fio-2.1.3
> 
> Starting 1 process
> 
> 
> testfile: (groupid=0, jobs=1): err= 0: pid=5967: Thu Apr 23 19:40:14 2015
> 
>   write: io=5762.7MB, bw=3278.3KB/s, iops=25, runt=1800048msec
> 
> clat (msec): min=5, max=807, avg=39.03, stdev=59.24
> 
>  lat (msec): min=5, max=807, avg=39.04, stdev=59.24
> 
> clat percentiles (msec):
> 
>  |  1.00th=[6],  5.00th=[7], 10.00th=[7], 20.00th=[8],
> 
>  | 30.00th=[   10], 40.00th=[   13], 50.00th=[   19], 60.00th=[   27],
> 
>  | 70.00th=[   37], 80.00th=[   50], 90.00th=[   92], 95.00th=[  149],
> 
>  | 99.00th=[  306], 99.50th=[  416], 99.90th=[  545], 99.95th=[  586],
> 
>  | 99.99th=[  725]
> 
> bw (KB  /s): min=  214, max=12142, per=100.00%, avg=3352.50,
> stdev=1595.13
> 
> lat (msec) : 10=32.27%, 20=19.42%, 50=28.25%, 100=10.94%, 250=7.67%
> 
> lat (msec) : 500=1.25%, 750=0.20%, 1000=0.01%
> 
>   cpu  : usr=0.06%, sys=0.18%, ctx=46686, majf=0, minf=29
> 
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
> 
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> 
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> 
>  issued: total=r=0/w=46101/d=0, short=r=0/w=0/d=0
> 
> 
> Run status group 0 (all jobs):
> 
>   WRITE: io=5762.7MB, aggrb=3278KB/s, minb=3278KB/s, maxb=3278KB/s,
> mint=1800048msec, maxt=1800048msec
> 
> 
> Disk stats (read/write):
> 
>   vdb: ios=0/46809, merge=0/355, ticks=0/1837916, in_queue=1837916,
> util=99.95%
> 
> 
> It's like it didn't honor the setting.  (And whew, 25 iops & 3M/sec, ouch.)
> 
> Thanks!




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Coding : gf-Complete

2015-04-23 Thread Garg, Pankaj
Hi,

I would like to use the gf-complete library for Erasure coding since it has 
some ARM v8 based optimizations. I see that the code is part of my tree, but 
not sure if these libraries are included in the final build.
I only see the libec_jerasure*.so in my libs folder after installation.
Are the gf-complete based optimizations part of this already? Or do I build 
them separately and then install them.
I am using the latest Firefly release (0.80.9).

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serving multiple applications with a single cluster

2015-04-23 Thread Rafael Coninck Teigão

Hi Nick,

Thanks for answering.

Each application runs on its own cluster (these are Glassfish clusters, 
and are distributed as nodes appA01, appA02,..., appB01, appB02, etc) 
and each node on the cluster has to have access to the same files. 
Currently we are using NFS for this, but it has its limitations (max 
size, HA).


I guess if I could just mount the same pool on each cluster node, it 
would work (say poolA on appA01 and appA02, poolB on appB01 and appB02), 
but this is not possible with RBD, right?


The main reason for separating the areas is security, so that the 
superuser of one application cluster can't access the files of the other 
two.


Thanks,
Rafael.

On 23/04/2015 17:05, Nick Fisk wrote:

Hi Rafael,

Do you require a shared FS for these applications or would a block device
with a traditional filesystem be suitable?

If it is, then you could create separate pools with a RBD block device in
each.

Just out of interest what is the reason for separation, security or
performance?

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Rafael Coninck Teigão
Sent: 23 April 2015 19:39
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Serving multiple applications with a single cluster

Hello everyone.

I'm new to the list and also just a beginner at using Ceph, and I'd like

to get

some advice from you on how to create the right infrastructure for our
scenario.

We'd like to provide storage to three different applications, but each

should

have its own "area". Also, ideally we'd like to avoid using RGW, so that

we

can deploy the new storage without changing the applications too much.

Is it possible to accomplish this with a single cluster? I know I won't be

able to

have multiple CephFS with decent isolation
(https://wiki.ceph.com/Planning/Sideboard/Client_Security_for_CephFS)
and that running multiple clusters on the same hardware involves changing
all the TCP ports for each instance.

I guess the perfect solution for us would be able to create different

pools

and serve them in different CephFS configurations, but that's not possible

as

of now right?

How would you go in configuring Ceph for this scenario?

Thanks,
Rafael.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serving multiple applications with a single cluster

2015-04-23 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Rafael Coninck Teigão
> Sent: 23 April 2015 22:35
> To: Nick Fisk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Serving multiple applications with a single
cluster
> 
> Hi Nick,
> 
> Thanks for answering.
> 
> Each application runs on its own cluster (these are Glassfish clusters,
and are
> distributed as nodes appA01, appA02,..., appB01, appB02, etc) and each
node
> on the cluster has to have access to the same files.
> Currently we are using NFS for this, but it has its limitations (max size,
HA).

Yes this would be the main advantage of using CephFS, but as you have stated
it might not give you the security functionality.

> 
> I guess if I could just mount the same pool on each cluster node, it would
> work (say poolA on appA01 and appA02, poolB on appB01 and appB02), but
> this is not possible with RBD, right?

RBD is not the problem, you can map it as many times as you want, but the
filesystem needs to support it. The only possible option would be
Pacemaker+NFS but I see above you see this as not meeting your requirements.
I'm not sure what else to suggest.

> 
> The main reason for separating the areas is security, so that the
superuser of
> one application cluster can't access the files of the other two.
> 
> Thanks,
> Rafael.
> 
> On 23/04/2015 17:05, Nick Fisk wrote:
> > Hi Rafael,
> >
> > Do you require a shared FS for these applications or would a block
> > device with a traditional filesystem be suitable?
> >
> > If it is, then you could create separate pools with a RBD block device
> > in each.
> >
> > Just out of interest what is the reason for separation, security or
> > performance?
> >
> > Nick
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Rafael Coninck Teigão
> >> Sent: 23 April 2015 19:39
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] Serving multiple applications with a single
> >> cluster
> >>
> >> Hello everyone.
> >>
> >> I'm new to the list and also just a beginner at using Ceph, and I'd
> >> like
> > to get
> >> some advice from you on how to create the right infrastructure for
> >> our scenario.
> >>
> >> We'd like to provide storage to three different applications, but
> >> each
> > should
> >> have its own "area". Also, ideally we'd like to avoid using RGW, so
> >> that
> > we
> >> can deploy the new storage without changing the applications too much.
> >>
> >> Is it possible to accomplish this with a single cluster? I know I
> >> won't be
> > able to
> >> have multiple CephFS with decent isolation
> >> (https://wiki.ceph.com/Planning/Sideboard/Client_Security_for_CephFS)
> >> and that running multiple clusters on the same hardware involves
> >> changing all the TCP ports for each instance.
> >>
> >> I guess the perfect solution for us would be able to create different
> > pools
> >> and serve them in different CephFS configurations, but that's not
> >> possible
> > as
> >> of now right?
> >>
> >> How would you go in configuring Ceph for this scenario?
> >>
> >> Thanks,
> >> Rafael.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding : gf-Complete

2015-04-23 Thread Loic Dachary
Hi,

The ARMv8 optimizations for gf-complete are in Hammer, not in Firefly. The 
libec_jerasure*.so plugin contains gf-complete.

Cheers

On 23/04/2015 23:29, Garg, Pankaj wrote:
> Hi,
> 
>  
> 
> I would like to use the gf-complete library for Erasure coding since it has 
> some ARM v8 based optimizations. I see that the code is part of my tree, but 
> not sure if these libraries are included in the final build.
> 
> I only see the libec_jerasure*.so in my libs folder after installation.
> 
> Are the gf-complete based optimizations part of this already? Or do I build 
> them separately and then install them.
> 
> I am using the latest Firefly release (0.80.9).
> 
>  
> 
> Thanks
> 
> Pankaj
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding : gf-Complete

2015-04-23 Thread Garg, Pankaj
Thanks Loic. I was just looking at the source trees for gf-complete and saw 
that v2-ceph tag has the optimizations and that's associated with Hammer.

One more question, on Hammer, will the Optimizations kick in automatically for 
ARM. Do all of the different techniques have ARM optimizations or do I have to 
select a particular one to take advantage of them?

-Pankaj

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org] 
Sent: Thursday, April 23, 2015 2:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Erasure Coding : gf-Complete

Hi,

The ARMv8 optimizations for gf-complete are in Hammer, not in Firefly. The 
libec_jerasure*.so plugin contains gf-complete.

Cheers

On 23/04/2015 23:29, Garg, Pankaj wrote:
> Hi,
> 
>  
> 
> I would like to use the gf-complete library for Erasure coding since it has 
> some ARM v8 based optimizations. I see that the code is part of my tree, but 
> not sure if these libraries are included in the final build.
> 
> I only see the libec_jerasure*.so in my libs folder after installation.
> 
> Are the gf-complete based optimizations part of this already? Or do I build 
> them separately and then install them.
> 
> I am using the latest Firefly release (0.80.9).
> 
>  
> 
> Thanks
> 
> Pankaj
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-23 Thread Anthony Levesque
To update you on the current test in our lab:

1.We tested the Samsung OSD in Recovery mode and the speed was able to maxout 
2x 10GbE port(transferring data at 2200+ MB/s during recovery). So for normal 
write operation without O_DSYNC writes Samsung drives seem ok.

2.We then tested a couple of different model of SSD we had in stock with the 
following command:

dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync

This was from a blog written by Sebastien Han and I think should be able to 
show how the drives would perform in O_DSYNC writes. For people interested in 
some result of what we tested here they are:

Intel DC S3500 120GB =  114 MB/s
Samsung Pro 128GB = 2.4 MB/s
WD Black 1TB (HDD) =409 KB/s
Intel 330 120GB =   105 MB/s
Intel 520 120GB =   9.4 MB/s
Intel 335 80GB =9.4 MB/s
Samsung EVO 1TB =   2.5 MB/s
Intel 320 120GB =   78 MB/s
OCZ Revo Drive 240GB =  60.8 MB/s
4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s

Please let us know if the command we ran was not optimal to test O_DSYNC writes

We order larger drive from Intel DC series to see if we could get more than 200 
MB/s per SSD. We will keep you posted on tests if that interested you guys. We 
dint test multiple parallel test yet (to simulate multiple journal on one SSD).

3.We remove the Journal from all Samsung OSD and put 2x Intel 330 120GB on all 
6 Node to test.  The overall speed we were getting from the rados bench went 
from 1000 MB/s(approx.) to 450 MB/s which might only be because the intel 
cannot do too much in term of journaling (was tested at around 100 MB/s).  It 
will be interesting to test with bigger Intel DC S3500 drives(and more 
journals) per node to see if I can back up to 1000MB/s or even surpass it.

We also wanted to test if the CPU could be a huge bottle neck so we swap the 
Dual E5-2620v2 from node #6 and replace them with Dual E5-2609v2(Which are much 
smaller in core and speed) and the 450 MB/s we got from he rados bench went 
even lower to 180 MB/s.

So Im wondering if the 1000MB/s we got when the Journal was shared on the OSD 
SSD was not limited by the CPUs (even though the samsung are not good for 
journals on the long run) and not just by the fact Samsung SSD are bad in 
O_DSYNC writes(or maybe both).  It is probable that 16 SSD OSD per node in a 
full SSD cluster is too much and the major bottleneck will be from the CPU.

4.Im wondering if we find good SSD for the journal and keep the samsung for 
normal writes and read(We can saturate 20GbE easy with read benchmark. We will 
test 40GbE soon) if the cluster will keep healthy since Samsung seem to get 
burnt from O_DSYNC writes.

5.In term of HBA controller, did you guys have made any test for a full SSD 
cluster or even just for SSD Journal.

Anthony Lévesque

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having trouble getting good performance

2015-04-23 Thread J David
On Thu, Apr 23, 2015 at 4:23 PM, Mark Nelson  wrote:
> If you want to adjust the iodepth, you'll need to use an asynchronous
> ioengine like libaio (you also need to use direct=1)

Ah yes, libaio makes a big difference.  With 1 job:

testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
ioengine=libaio, iodepth=64
fio-2.1.3
Starting 1 process

testfile: (groupid=0, jobs=1): err= 0: pid=6290: Thu Apr 23 20:43:27 2015
  write: io=30720MB, bw=28503KB/s, iops=222, runt=1103633msec
slat (usec): min=12, max=1049.4K, avg=2427.89, stdev=13913.04
clat (msec): min=4, max=1975, avg=284.97, stdev=268.71
 lat (msec): min=4, max=1975, avg=287.40, stdev=268.37
clat percentiles (msec):
 |  1.00th=[7],  5.00th=[   11], 10.00th=[   20], 20.00th=[   36],
 | 30.00th=[   60], 40.00th=[  120], 50.00th=[  219], 60.00th=[  318],
 | 70.00th=[  416], 80.00th=[  519], 90.00th=[  652], 95.00th=[  766],
 | 99.00th=[ 1090], 99.50th=[ 1221], 99.90th=[ 1516], 99.95th=[ 1598],
 | 99.99th=[ 1860]
bw (KB  /s): min=  236, max=170082, per=100.00%, avg=29037.74,
stdev=15788.85
lat (msec) : 10=4.63%, 20=5.77%, 50=16.59%, 100=10.64%, 250=15.40%
lat (msec) : 500=25.38%, 750=15.89%, 1000=4.00%, 2000=1.70%
  cpu  : usr=0.37%, sys=1.00%, ctx=99920, majf=0, minf=27
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=0/w=245760/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=30720MB, aggrb=28503KB/s, minb=28503KB/s, maxb=28503KB/s,
mint=1103633msec, maxt=1103633msec

Disk stats (read/write):
  vdb: ios=0/246189, merge=0/219, ticks=0/67559576, in_queue=67564864,
util=100.00%

With 2 jobs:

testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
ioengine=libaio, iodepth=64
testfile: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K,
ioengine=libaio, iodepth=64
fio-2.1.3
Starting 2 processes

testfile: (groupid=0, jobs=2): err= 0: pid=6394: Thu Apr 23 21:24:09 2015
  write: io=46406MB, bw=26384KB/s, iops=206, runt=1801073msec
slat (usec): min=11, max=3457.7K, avg=9589.56, stdev=44841.01
clat (msec): min=5, max=5256, avg=611.29, stdev=507.51
 lat (msec): min=5, max=5256, avg=620.88, stdev=510.21
clat percentiles (msec):
 |  1.00th=[   25],  5.00th=[   62], 10.00th=[  102], 20.00th=[  192],
 | 30.00th=[  293], 40.00th=[  396], 50.00th=[  502], 60.00th=[  611],
 | 70.00th=[  742], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1582],
 | 99.00th=[ 2376], 99.50th=[ 2769], 99.90th=[ 3687], 99.95th=[ 4080],
 | 99.99th=[ 4686]
bw (KB  /s): min=   98, max=108111, per=53.88%, avg=14214.41, stdev=10031.64
lat (msec) : 10=0.24%, 20=0.46%, 50=2.85%, 100=6.27%, 250=16.04%
lat (msec) : 500=24.00%, 750=20.47%, 1000=12.35%, 2000=15.14%, >=2000=2.17%
  cpu  : usr=0.18%, sys=0.49%, ctx=291909, majf=0, minf=55
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=0/w=371246/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=46406MB, aggrb=26383KB/s, minb=26383KB/s, maxb=26383KB/s,
mint=1801073msec, maxt=1801073msec

Disk stats (read/write):
  vdb: ios=0/371958, merge=0/358, ticks=0/111668288,
in_queue=111672480, util=100.00%

And here is some "iostat -xt 10" from the start of the ZFS machine
doing a snapshot receive:  (vdb = the Ceph RBD)

04/24/2015 12:12:50 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.100.000.300.000.00   99.60

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.000.10 0.00 0.40
8.00 0.000.000.000.00   0.00   0.00
vdb   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
vdc   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

04/24/2015 12:13:00 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.600.001.209.270.00   88.93

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.201.70 2.40 6.80
9.68 0.013.37   20.001.41   3.37   0.64
vdb   0.00 0.000.20   13.50 0.50   187.10
27.39 0.26   18.86  112.00   17.48  13.55  18.56
vdc   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

04/24/2015 12:13:10 AM
avg-cpu:  %user   %nice %s

Re: [ceph-users] Accidentally Remove OSDs

2015-04-23 Thread FaHui Lin

Hi, thank you for your response.

Well, I've not only taken out but also totally removed the both OSDs (by 
"ceph osd rm" and delete everything in /var/lib/ceph/osd/) 
of that pg (and similar to all other stale pgs.)


The main problem I have is those stale pgs (miss all OSDs I've removed) 
not merely make ceph health warning, but other machine cannot mount the 
ceph rbd as well.


Here's the full crush map.  The OSDs I removed were osd.5~19.

   # begin crush map
   tunable choose_local_tries 0
   tunable choose_local_fallback_tries 0
   tunable choose_total_tries 500

   # devices
   device 0 osd.0
   device 1 device1
   device 2 osd.2
   device 3 osd.3
   device 4 osd.4
   device 5 device5
   device 6 device6
   device 7 device7
   device 8 device8
   device 9 device9
   device 10 device10
   device 11 device11
   device 12 device12
   device 13 device13
   device 14 device14
   device 15 device15
   device 16 device16
   device 17 device17
   device 18 device18
   device 19 device19
   device 20 osd.20
   device 21 osd.21
   device 22 osd.22
   device 23 osd.23
   device 24 osd.24
   device 25 osd.25
   device 26 osd.26
   device 27 osd.27

   # types
   type 0 osd
   type 1 host
   type 2 rack
   type 3 row
   type 4 room
   type 5 datacenter
   type 6 root

   # buckets
   host XX-ceph01 {
id -2   # do not change unnecessarily
# weight 160.040
alg straw
hash 0  # rjenkins1
item osd.0 weight 40.010
item osd.2 weight 40.010
item osd.3 weight 40.010
item osd.4 weight 40.010
   }
   host XX-ceph02 {
id -3   # do not change unnecessarily
# weight 320.160
alg straw
hash 0  # rjenkins1
item osd.20 weight 40.020
item osd.21 weight 40.020
item osd.22 weight 40.020
item osd.23 weight 40.020
item osd.24 weight 40.020
item osd.25 weight 40.020
item osd.26 weight 40.020
item osd.27 weight 40.020
   }
   root default {
id -1   # do not change unnecessarily
# weight 480.200
alg straw
hash 0  # rjenkins1
item XX-ceph01 weight 160.040
item XX-ceph02 weight 320.160
   }

   # rules
   rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
   }
   rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
   }
   rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
   }

   # end crush map

List of some stale pgs:

   pg_stat objects mip degrmispunf bytes   log disklog
   state   state_stamp v   reported up  up_primary 
   acting  acting_primary last_scrub  scrub_stamp
   last_deep_scrub deep_scrub_stamp
   17.c6   0   0   0   0   0   0 0   0  
   stale+active+clean  2015-04-20 09:16:09.358613  0'0
   2706:216[19,13] 19 [19,13] 19  0'0 2015-04-16

   02:29:34.882038
  0'0 2015-04-16 02:29:34.882038
   17.c7   0   0   0   0   0   0 0   0  
   stale+active+clean  2015-04-20 09:16:28.304621  0'0
   2718:262[15,18] 15 [15,18] 15  0'0 2015-04-20

   09:15:39.363310
  0'0 2015-04-20 09:15:39.363310
   17.c1   0   0   0   0   0   0 0   0  
   stale+active+clean  2015-04-20 09:16:01.073681  0'0
   2706:199[19,16] 19 [19,16] 19  0'0 2015-04-15

   12:37:11.741251
  0'0 2015-04-15 12:37:11.741251
   17.de   0   0   0   0   0   0 0   0  
   stale+active+undersized+degraded 2015-04-20 23:41:29.436796 
   0'0 2718:267 [15]15  [15]15  0'0 2015-04-13

   07:56:01.760824  0'0 2015-04-13 07:56:01.760824
   17.da   0   0   0   0   0   0 0   0  
   stale+active+undersized+degraded 2015-04-20 23:41:50.001087 
   0'0 2718:232 [14]14  [14]14  0'0 2015-04-19

   15:45:53.304596  0'0 2015-04-19 15:45:53.304596
   17.d9   0   0   0   0   0   0 0   0  
   stale+active+undersized+degraded 2015-04-20 23:41:29.472983 
   0'0 2718:270 [14]14  [14]14  0'0 2015-04-16

   01:55:44.183550  0'0 2015-04-16 01:55:44.183550
   17.d7   0   0   0   0   0   0 0   0  
   stale+active+undersized+degraded 2015-04-20 23:41:53.

Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-23 Thread Christian Balzer

Hello,

On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote:

> To update you on the current test in our lab:
> 
> 1.We tested the Samsung OSD in Recovery mode and the speed was able to
> maxout 2x 10GbE port(transferring data at 2200+ MB/s during recovery).
> So for normal write operation without O_DSYNC writes Samsung drives seem
> ok.
> 
> 2.We then tested a couple of different model of SSD we had in stock with
> the following command:
> 
> dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync
> 
> This was from a blog written by Sebastien Han and I think should be able
> to show how the drives would perform in O_DSYNC writes. For people
> interested in some result of what we tested here they are:
> 
> Intel DC S3500 120GB =114 MB/s
> Samsung Pro 128GB =   2.4 MB/s
> WD Black 1TB (HDD) =  409 KB/s
> Intel 330 120GB = 105 MB/s
> Intel 520 120GB = 9.4 MB/s
> Intel 335 80GB =  9.4 MB/s
> Samsung EVO 1TB = 2.5 MB/s
> Intel 320 120GB = 78 MB/s
> OCZ Revo Drive 240GB =60.8 MB/s
> 4x Samsung EVO 1TB LSI RAID0 HW + BBU =   28.4 MB/s
>
No real surprises here, but a nice summary nonetheless. 

You _really_ want to avoid consumer SSDs for journals and have a good idea
on how much data you'll write per day and how long you expect your SSDs to
last (the TBW/$ ratio).

> Please let us know if the command we ran was not optimal to test O_DSYNC
> writes
> 
> We order larger drive from Intel DC series to see if we could get more
> than 200 MB/s per SSD. We will keep you posted on tests if that
> interested you guys. We dint test multiple parallel test yet (to
> simulate multiple journal on one SSD).
> 
You can totally trust the numbers on Intel's site:
http://ark.intel.com/products/family/83425/Data-Center-SSDs

The S3500s are by far the slowest and have the lowest endurance.
Again, depending on your expected write level the S3610 or S3700 models
are going to be a better fit regarding price/performance. 
Especially when you consider that loosing a journal SSD will result in
several dead OSDs. 

> 3.We remove the Journal from all Samsung OSD and put 2x Intel 330 120GB
> on all 6 Node to test.  The overall speed we were getting from the rados
> bench went from 1000 MB/s(approx.) to 450 MB/s which might only be
> because the intel cannot do too much in term of journaling (was tested
> at around 100 MB/s).  It will be interesting to test with bigger Intel
> DC S3500 drives(and more journals) per node to see if I can back up to
> 1000MB/s or even surpass it.
> 
> We also wanted to test if the CPU could be a huge bottle neck so we swap
> the Dual E5-2620v2 from node #6 and replace them with Dual
> E5-2609v2(Which are much smaller in core and speed) and the 450 MB/s we
> got from he rados bench went even lower to 180 MB/s.
> 
You really don't have to swap CPUs around, monitor things with atop or
other tools to see where your bottlenecks are.

> So Im wondering if the 1000MB/s we got when the Journal was shared on
> the OSD SSD was not limited by the CPUs (even though the samsung are not
> good for journals on the long run) and not just by the fact Samsung SSD
> are bad in O_DSYNC writes(or maybe both).  It is probable that 16 SSD
> OSD per node in a full SSD cluster is too much and the major bottleneck
> will be from the CPU.
> 
That's what I kept saying. ^.^

> 4.Im wondering if we find good SSD for the journal and keep the samsung
> for normal writes and read(We can saturate 20GbE easy with read
> benchmark. We will test 40GbE soon) if the cluster will keep healthy
> since Samsung seem to get burnt from O_DSYNC writes.
> 
They will get burned, as in have their cells worn out by any and all
writes.

> 5.In term of HBA controller, did you guys have made any test for a full
> SSD cluster or even just for SSD Journal.
> 
If you have separate journals and OSDs, it often makes good sense to have
them on separate controllers as well. 
It all depends on density of your setup and capabilities of the
controllers.
LSI HBAs in IT mode are a known and working entity.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accidentally Remove OSDs

2015-04-23 Thread Robert LeBlanc
What hosts were those OSDS on? I'm concerned that two OSDS for some of the
PGS were adjacent and if that placed them on the same host, it would be
contrary to your rules and something deeper is wrong.

Did you format the disks that were taken out of the cluster? Can you mount
the partitions and see the files and directories? If so, you can probably
recover the data using the tools from the recovery/dev tools.

You may be able to force create the missing PGS using ceph force-create <
pg.id>. This may or may not work, I don't remember.

If you just don't care about losing data, you can delete the pool and
create a new one. This should work for sure, but losses any data that you
might have still had. If this pool was full of RBD, then there is a high
possibility that all of your RBD images had chunks in the missing PGs. If
you choose not to try to restore the PGS using the tools,  I'd be inclined
to delete the pool and restore from back up as to not be surprised by data
corruption in the images. Neither option is ideal or quick.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Apr 23, 2015 6:42 PM, "FaHui Lin"  wrote:

>  Hi, thank you for your response.
>
> Well, I've not only taken out but also totally removed the both OSDs (by
> "ceph osd rm" and delete everything in /var/lib/ceph/osd/) of
> that pg (and similar to all other stale pgs.)
>
> The main problem I have is those stale pgs (miss all OSDs I've removed)
> not merely make ceph health warning, but other machine cannot mount the
> ceph rbd as well.
>
> Here's the full crush map.  The OSDs I removed were osd.5~19.
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 500
>
> # devices
> device 0 osd.0
> device 1 device1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 device5
> device 6 device6
> device 7 device7
> device 8 device8
> device 9 device9
> device 10 device10
> device 11 device11
> device 12 device12
> device 13 device13
> device 14 device14
> device 15 device15
> device 16 device16
> device 17 device17
> device 18 device18
> device 19 device19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
>
> # buckets
> host XX-ceph01 {
> id -2   # do not change unnecessarily
> # weight 160.040
> alg straw
> hash 0  # rjenkins1
> item osd.0 weight 40.010
> item osd.2 weight 40.010
> item osd.3 weight 40.010
> item osd.4 weight 40.010
> }
> host XX-ceph02 {
> id -3   # do not change unnecessarily
> # weight 320.160
> alg straw
> hash 0  # rjenkins1
> item osd.20 weight 40.020
> item osd.21 weight 40.020
> item osd.22 weight 40.020
> item osd.23 weight 40.020
> item osd.24 weight 40.020
> item osd.25 weight 40.020
> item osd.26 weight 40.020
> item osd.27 weight 40.020
> }
> root default {
> id -1   # do not change unnecessarily
> # weight 480.200
> alg straw
> hash 0  # rjenkins1
> item XX-ceph01 weight 160.040
> item XX-ceph02 weight 320.160
> }
>
> # rules
> rule data {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule metadata {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule rbd {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
> List of some stale pgs:
>
> pg_stat objects mip degrmispunf bytes   log disklog
> state   state_stamp v   reportedup  up_primary
> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub
> deep_scrub_stamp
> 17.c6   0   0   0   0   0   0   0   0
> stale+active+clean  2015-04-20 09:16:09.358613  0'0
> 2706:216[19,13] 19  [19,13] 19  0'0 2015-04-16
> 02:29:34.882038
>   0'0 2015-04-16 02:29:34.882038
> 17.c7   0   0   0   0   0   0   0   0
> stale+active+clean  2015-04-20 09:16:28.304621  0'0
> 2718:262[15,18] 15  [15,18] 15  0'0 2015-04-20
> 09:15:39.363310
>   0'0 2015-04-20 09:15:39.363310
> 17.c1   0   0   0   0   0   0   0   0
> stale+active+clean  2015-04-20 09:16:01.073681  0'0
> 2706:199[19,16] 19  [19,16] 19 

Re: [ceph-users] Serving multiple applications with a single cluster

2015-04-23 Thread Robert LeBlanc
You could map the RBD to each host and put a cluster file system like OCFS2
on it so all cluster nodes can read and write at the same time. If these
are VMs, then you can present the RBD in libvirt and the root user would
not have access to mount other RBD in the same pool.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Apr 23, 2015 3:41 PM, "Nick Fisk"  wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Rafael Coninck Teigão
> > Sent: 23 April 2015 22:35
> > To: Nick Fisk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Serving multiple applications with a single
> cluster
> >
> > Hi Nick,
> >
> > Thanks for answering.
> >
> > Each application runs on its own cluster (these are Glassfish clusters,
> and are
> > distributed as nodes appA01, appA02,..., appB01, appB02, etc) and each
> node
> > on the cluster has to have access to the same files.
> > Currently we are using NFS for this, but it has its limitations (max
> size,
> HA).
>
> Yes this would be the main advantage of using CephFS, but as you have
> stated
> it might not give you the security functionality.
>
> >
> > I guess if I could just mount the same pool on each cluster node, it
> would
> > work (say poolA on appA01 and appA02, poolB on appB01 and appB02), but
> > this is not possible with RBD, right?
>
> RBD is not the problem, you can map it as many times as you want, but the
> filesystem needs to support it. The only possible option would be
> Pacemaker+NFS but I see above you see this as not meeting your
> requirements.
> I'm not sure what else to suggest.
>
> >
> > The main reason for separating the areas is security, so that the
> superuser of
> > one application cluster can't access the files of the other two.
> >
> > Thanks,
> > Rafael.
> >
> > On 23/04/2015 17:05, Nick Fisk wrote:
> > > Hi Rafael,
> > >
> > > Do you require a shared FS for these applications or would a block
> > > device with a traditional filesystem be suitable?
> > >
> > > If it is, then you could create separate pools with a RBD block device
> > > in each.
> > >
> > > Just out of interest what is the reason for separation, security or
> > > performance?
> > >
> > > Nick
> > >
> > >> -Original Message-
> > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > >> Of Rafael Coninck Teigão
> > >> Sent: 23 April 2015 19:39
> > >> To: ceph-users@lists.ceph.com
> > >> Subject: [ceph-users] Serving multiple applications with a single
> > >> cluster
> > >>
> > >> Hello everyone.
> > >>
> > >> I'm new to the list and also just a beginner at using Ceph, and I'd
> > >> like
> > > to get
> > >> some advice from you on how to create the right infrastructure for
> > >> our scenario.
> > >>
> > >> We'd like to provide storage to three different applications, but
> > >> each
> > > should
> > >> have its own "area". Also, ideally we'd like to avoid using RGW, so
> > >> that
> > > we
> > >> can deploy the new storage without changing the applications too much.
> > >>
> > >> Is it possible to accomplish this with a single cluster? I know I
> > >> won't be
> > > able to
> > >> have multiple CephFS with decent isolation
> > >> (https://wiki.ceph.com/Planning/Sideboard/Client_Security_for_CephFS)
> > >> and that running multiple clusters on the same hardware involves
> > >> changing all the TCP ports for each instance.
> > >>
> > >> I guess the perfect solution for us would be able to create different
> > > pools
> > >> and serve them in different CephFS configurations, but that's not
> > >> possible
> > > as
> > >> of now right?
> > >>
> > >> How would you go in configuring Ceph for this scenario?
> > >>
> > >> Thanks,
> > >> Rafael.
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shadow Files

2015-04-23 Thread Ben
We are still experiencing a problem with out gateway not properly 
clearing out shadow files.


I have done numerous tests where I have:
-Uploaded a file of 1.5GB in size using s3browser application
-Done an object stat on the file to get its prefix
-Done rados ls -p .rgw.buckets | grep  to count the number of 
shadow files associated (in this case it is around 290 shadow files)

-Deleted said file with s3browser
-Performed a gc list, which shows the ~290 files listed
-Waited 24 hours to redo the rados ls -p .rgw.buckets | grep  to 
recount the shadow files only to be left with 290 files still there


From log output /var/log/ceph/radosgw.log, I can see the following when 
clicking DELETE (this appears 290 times)
2015-04-24 10:43:29.996523 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996557 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996564 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996570 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996576 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996581 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996586 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule->part_size=0
2015-04-24 10:43:29.996592 7f0b0afb5700  0 RGWObjManifest::operator++(): 
result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule->part_size=0


In this same log, I also see the gc process saying it is removing said 
file (these records appear 290 times too)
2015-04-23 14:16:27.926952 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.928572 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.929636 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.930448 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.931226 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.932103 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:
2015-04-23 14:16:27.933470 7f15be0ee700  0 gc::process: removing 
.rgw.buckets:


So even though it appears that the GC is processing its removal, the 
shadow files remain!


Please help!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accidentally Remove OSDs

2015-04-23 Thread FaHui Lin

Dear Robert,

Yes, you're right. The two OSDs removed of the PGs are from the same 
host and contradict to my rules (that's a reason I removed them). 
Unfortunately the partitions of the disk are all formatted so I cannot 
recover the data.


However, the command "ceph pg force_create_pg " and restarting 
the OSD daemons works to clean stale pgs. Now my ceph health is OK and 
the rbd service can work normally.


Many thanks for your help,
FaHui


Robert LeBlanc 於 2015/4/24 上午 10:08 寫道:


What hosts were those OSDS on? I'm concerned that two OSDS for some of 
the PGS were adjacent and if that placed them on the same host, it 
would be contrary to your rules and something deeper is wrong.


Did you format the disks that were taken out of the cluster? Can you 
mount the partitions and see the files and directories? If so, you can 
probably recover the data using the tools from the recovery/dev tools.


You may be able to force create the missing PGS using ceph 
force-create http://pg.id>>. This may or may not work, I don't 
remember.


If you just don't care about losing data, you can delete the pool and 
create a new one. This should work for sure, but losses any data that 
you might have still had. If this pool was full of RBD, then there is 
a high possibility that all of your RBD images had chunks in the 
missing PGs. If you choose not to try to restore the PGS using the 
tools,  I'd be inclined to delete the pool and restore from back up as 
to not be surprised by data corruption in the images. Neither option 
is ideal or quick.


Robert LeBlanc

Sent from a mobile device please excuse any typos.

On Apr 23, 2015 6:42 PM, "FaHui Lin" > wrote:


Hi, thank you for your response.

Well, I've not only taken out but also totally removed the both
OSDs (by "ceph osd rm" and delete everything in
/var/lib/ceph/osd/) of that pg (and similar to all
other stale pgs.)

The main problem I have is those stale pgs (miss all OSDs I've
removed) not merely make ceph health warning, but other machine
cannot mount the ceph rbd as well.

Here's the full crush map.  The OSDs I removed were osd.5~19.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 500

# devices
device 0 osd.0
device 1 device1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host XX-ceph01 {
id -2   # do not change unnecessarily
# weight 160.040
alg straw
hash 0  # rjenkins1
item osd.0 weight 40.010
item osd.2 weight 40.010
item osd.3 weight 40.010
item osd.4 weight 40.010
}
host XX-ceph02 {
id -3   # do not change unnecessarily
# weight 320.160
alg straw
hash 0  # rjenkins1
item osd.20 weight 40.020
item osd.21 weight 40.020
item osd.22 weight 40.020
item osd.23 weight 40.020
item osd.24 weight 40.020
item osd.25 weight 40.020
item osd.26 weight 40.020
item osd.27 weight 40.020
}
root default {
id -1   # do not change unnecessarily
# weight 480.200
alg straw
hash 0  # rjenkins1
item XX-ceph01 weight 160.040
item XX-ceph02 weight 320.160
}

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
t