[ceph-users] A few questions and remarks about cephx

2015-09-06 Thread Marin Bernard
Hi,

I've just setup Ceph Hammer (latest version) on a single node (1 MON, 1
MDS, 4 OSDs) for testing purposes. I used ceph-deploy. I only
configured CephFS as I don't use RBD. My pool config is as follows:

$ sudo ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED 
7428G 7258G 169G  2.29 
POOLS:
NAMEID USED   %USED MAX AVAIL
 OBJECTS 
cephfs_data 1168G  2.26 7209G  
 78691 
cephfs_metadata 2  41301k 0 7209G   
 2525

Cluster is sane:

$ sudo ceph status
cluster 72aba9bb-20db-4f62-8d03-0a8a1019effa
 health HEALTH_OK
 monmap e1: 1 mons at {nice-srv-cosd-00=10.16.1.161:6789/0}
election epoch 1, quorum 0 nice-srv-cosd-00
 mdsmap e5: 1/1/1 up {0=nice-srv-cosd-00=up:active}
 osdmap e71: 4 osds: 4 up, 4 in
  pgmap v3723: 240 pgs, 2 pools, 167 GB data, 80969 objects
168 GB used, 7259 GB / 7428 GB avail
 240 active+clean
  client io 59391 kB/s wr, 29 op/s

CephFS is mounted on a client node, which uses a dedicated cephx key
'client.mynode'. I've had a hard time trying to figure out which cephx
 capabilities were required to give the node RW access to CephFS. I
found documentation covering cephx capabilities for RBD, but not for
CephFS. Did I miss something ? As of now, the 'client.mynode' key has
the following capabilities, which seem sufficient:

$ sudo ceph auth get client.mynode
exported keyring for client.mynode
[client.mynode]
key = myBeautifulKey
caps mds = "allow r"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_metadata, allow rw
pool=cephfs_data"


Here are a few questions and remarks I made for myself when dealing
with cephx:

1. Are mds caps needed for CephFS clients? If so, do they need r or rw
access ? Is it documented somewhere ?


2. CephFS requires the clients to have rw access to multiple pools
(data + metadata). I couldn't find the correct syntax to use with 'ceph
auth caps' anywhere but on the ML archive (
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17058.html).
I suggest to add some documentation for it on the main website. Or is
it already there ?


3. I found 'ceph auth caps' syntax validation rather weak, as the
command did not return an error in the case of an incorrect syntax. For
instance, the following command did not raise an error whereas it is
(probably) syntactically incorrect:

$ sudo ceph auth caps client.mynode mon 'allow r' mds 'allow r' osd
'allow rw pool=cephfs_metadata,cephfs_data'

I suppose the comma is considered as a part of a single pool name, thus
resulting in:

$ sudo ceph auth get client.mynode
exported keyring for client.mynode
[cl
ient.mynode]
key = myBeautifulKey
caps mds = "allow r"

caps mon = "allow r"
caps osd = "allow rw
pool=cephfs_metadata,cephfs_data"

Is it expected behaviour? Are special chars allowed in pool names ?


4. With the capabilities shown above, the client node was still able to
mount CephFS and to make thousands of reads and writes without any
error. However, since capabilities were incorrect, it only had rw
access to the 'cephfs_metadata' pool, and no access at all to the
'cephfs_data' pool. As a consequence, files, folders, permissions,
sizes and other metadata were written and retrieved correctly, but the
actual data were lost in vacuum. Shouldn't such a strange situation
raise an error on the client ?


Thanks!

Marin.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-06 Thread Nick Fisk
Just a quick update after up'ing the thresholds, not much happened. This is 
probably because the merge threshold is several times less than the trigger for 
the split. So I have now bumped the merge threshold up to 1000 temporarily to 
hopefully force some DIR's to merge. 

I believe this has started to happen, but it only seems to merge right at the 
bottom of the tree.

Eg

/var/lib/ceph/osd/ceph-1/current/0.106_head/DIR_6/DIR_0/DIR_1/

All the Directory's only 1 have directory in them, DIR_1 is the only one in the 
path that has any objects in it. Is this the correct behaviour? Is there any 
impact from having these deeper paths compared to when the objects are just in 
the root directory?

I guess the only real way to get the objects back into the root would be to 
out->drain->in the OSD?


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Shinobu Kinjo
> Sent: 05 September 2015 01:42
> To: GuangYang 
> Cc: ceph-users ; Nick Fisk 
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
> Very nice.
> You're my hero!
> 
>  Shinobu
> 
> - Original Message -
> From: "GuangYang" 
> To: "Shinobu Kinjo" 
> Cc: "Ben Hines" , "Nick Fisk" , "ceph-
> users" 
> Sent: Saturday, September 5, 2015 9:40:06 AM
> Subject: RE: [ceph-users] Ceph performance, empty vs part full
> 
> 
> > Date: Fri, 4 Sep 2015 20:31:59 -0400
> > From: ski...@redhat.com
> > To: yguan...@outlook.com
> > CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Ceph performance, empty vs part full
> >
> >> IIRC, it only triggers the move (merge or split) when that folder is hit 
> >> by a
> request, so most likely it happens gradually.
> >
> > Do you know what causes this?
> A requests (read/write/setxattr, etc) hitting objects in that folder.
> > I would like to be more clear "gradually".
> >
> > Shinobu
> >
> > - Original Message -
> > From: "GuangYang" 
> > To: "Ben Hines" , "Nick Fisk" 
> > Cc: "ceph-users" 
> > Sent: Saturday, September 5, 2015 9:27:31 AM
> > Subject: Re: [ceph-users] Ceph performance, empty vs part full
> >
> > IIRC, it only triggers the move (merge or split) when that folder is hit by 
> > a
> request, so most likely it happens gradually.
> >
> > Another thing might be helpful (and we have had good experience with), is
> that we do the folder splitting at the pool creation time, so that we avoid 
> the
> performance impact with runtime splitting (which is high if you have a large
> cluster). In order to do that:
> >
> > 1. You will need to configure "filestore merge threshold" with a negative
> value so that it disables merging.
> > 2. When creating the pool, there is a parameter named
> "expected_num_objects", by specifying that number, the folder will splitted
> to the right level with the pool creation.
> >
> > Hope that helps.
> >
> > Thanks,
> > Guang
> >
> >
> > 
> >> From: bhi...@gmail.com
> >> Date: Fri, 4 Sep 2015 12:05:26 -0700
> >> To: n...@fisk.me.uk
> >> CC: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph performance, empty vs part full
> >>
> >> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
> >> a ticket to request a way to tell an OSD to rebalance its directory
> >> structure.
> >>
> >> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
> >>> I've just made the same change ( 4 and 40 for now) on my cluster
> >>> which is a similar size to yours. I didn't see any merging
> >>> happening, although most of the directory's I looked at had more
> >>> files in than the new merge threshold, so I guess this is to be
> >>> expected
> >>>
> >>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to
> bring things back into order.
> >>>
>  -Original Message-
>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>  Behalf Of Wang, Warren
>  Sent: 04 September 2015 01:21
>  To: Mark Nelson ; Ben Hines
> 
>  Cc: ceph-users 
>  Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
>  I'm about to change it on a big cluster too. It totals around 30
>  million, so I'm a bit nervous on changing it. As far as I
>  understood, it would indeed move them around, if you can get
>  underneath the threshold, but it may be hard to do. Two more
>  settings that I highly recommend changing on a big prod cluster. I'm in
> favor of bumping these two up in the defaults.
> 
>  Warren
> 
>  -Original Message-
>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>  Behalf Of Mark Nelson
>  Sent: Thursday, September 03, 2015 6:04 PM
>  To: Ben Hines 
>  Cc: ceph-users 
>  Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
>  Hrm, I think it will follow the merge/split rules if it's out of
>  whack given the new setti

Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-06 Thread Alexandre DERUMIER
>>Thank you, will these packages be provided to debian upstream as well.

debian manage his own repository, and only provide firefly.

You can add ceph.com repository if you want newer releases (giant,hammer,)


- Mail original -
De: "Jelle de Jong" 
À: "ceph-users" 
Envoyé: Samedi 5 Septembre 2015 18:54:53
Objet: Re: [ceph-users] [sepia] debian jessie repository ?

On 02/09/15 16:10, Alfredo Deza wrote: 
> As of yesterday we are now ready to start providing Debian Jessie 
> packages. They will be present by default for the upcoming Ceph release 
> (Infernalis). 
> 
> For other releases (e.g. Firefly, Hammer, Giant) it means that there 
> will be a Jessie package for them for new versions only. 
> 
> Let me know if you have any questions. 

Thank you, will these packages be provided to debian upstream as well. 
We are using ceph 0.80.9-2~bpo8+1 from jessie-backports[1] and preparing 
to take it in production. 

[1] https://packages.debian.org/jessie-backports/ceph 

Kind regards, 

Jelle de Jong 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-06 Thread Yan, Zheng



> On Sep 3, 2015, at 21:19, Janusz Borkowski  
> wrote:
> 
> Hi!
> 
> Actually, it looks that O_APPEND does not work even if the file kept open 
> read-only (reader + writer). Test:
> 
> in one session
>> less /mnt/ceph/test
> in another session
>> echo "start or end" >> /mnt/ceph/test

I can’t reproduce this on 4.1 kernel.

I wrote a simple fix 
https://github.com/ceph/ceph-client/commit/53c2bc09db6119058170c7dd486788c9aafbfe8b
 
.
 It gets inode size before each write. This fix is still racy for multiple 
writer case. If you want strict append behaviour, please wrap each write with 
file lock,

Regards
Yan, Zheng

> 
> "start or end" is written to the start of the file.
> J.
> 
> On 02.09.2015 11:50, Yan, Zheng wrote:
>>> On Sep 2, 2015, at 17:11, Gregory Farnum  wrote:
>>> 
>>> Whoops, forgot to add Zheng.
>>> 
>>> On Wed, Sep 2, 2015 at 10:11 AM, Gregory Farnum  wrote:
 On Wed, Sep 2, 2015 at 10:00 AM, Janusz Borkowski
  wrote:
> Hi!
> 
> I mount cephfs using kernel client (3.10.0-229.11.1.el7.x86_64).
> 
> The effect is the same when doing "echo >>" from another machine and from 
> a
> machine keeping the file open.
> 
> The file is opened with open( ..,
> O_WRONLY|O_LARGEFILE|O_APPEND|O_BINARY|O_CREAT)
> 
> Shell ">>" is implemented as (from strace bash -c "echo '7789' >>
> /mnt/ceph/test):
> 
>   open("/mnt/ceph/test", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
> 
> The test file had ~500KB size.
> 
> Each subsequent "echo >>" writes to the start of the test file, first 
> "echo"
> overwriting the original contents, next "echos" overwriting bytes written 
> by
> the preceding "echo".
 Hmmm. The userspace (ie, ceph-fuse) implementation of this is a little
 bit racy but ought to work. I'm not as familiar with the kernel code
 but I'm not seeing any special behavior in the Ceph code — Zheng,
 would you expect this to work? It looks like some of the linux
 filesystems have their own O_APPEND handling and some don't, but I
 can't find it in the VFS either.
 -Greg
>> Yes, the kernel client does not handle the case that multiple clients do 
>> append write to the same file. I will fix it soon.
>> 
>> Regards
>> Yan, Zheng
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-06 Thread Christian Balzer
On Sat, 5 Sep 2015 07:13:29 -0300 German Anders wrote:

> Hi Christian,
> 
> Ok so would said that it's better to rearrange the nodes so i dont
> mix the hdd and ssd disks right? And create high perf nodes with ssd and
> others with hdd, its fine since its a new deploy.
>
It is what I would do, yes. 
However if you're limited to 7 nodes initially specialized/optimized nodes
might result in pretty small "subclusters" and thus relatively large
failure domains. 

If for example this cluster would consisted of 2 SSD and 5 HDD nodes,
loosing 1 of the SSD nodes would roughly halve your read speed from that
pool (while amusingly enough improve your write speed ^o^).
This is assuming a replication of 2 for SSD pools, which with DC SSDs is a
pretty safe choice.

Also dense SSDs nodes will be able to saturate your network easily, for
example 3-4 of the DC S3xxx SSDs will exceed the bandwidth of your links.
This is of course only an issue if you're actually expecting huge amounts
of reads/writes, as apposed to have lots of small transactions that depend
on low latency.

>Also the nodes had different type of ram cpu, 4 had more cpu and more
> memory 384gb and other 3 had less cpu and 128gb of ram, so maybe i can
> put the ssd con the much more cpu nodes and left the hdd for the other
> nodes. 

I take it from this that you already have those machines?
Which number and models CPUs exactly?

What you want is as MUCH CPU power for any SSD node as possible, while the
HDD nodes will benefit mostly from more RAM (page cache).

> Network is going to be used infiniband fdr at 56gb/s on all the
> nodes for the publ network and for the clus network.
>
Is this 1 interface for the public and 1 for the cluster network?
Note that with IPoIB (with Accelio not being ready yet) I'm seeing at most
1.5GByte/s with QDR (40Gb/s).

If you were to start with a clean slate, I'd go with something like this
to achieve the storage capacity you outlined:

* 1-2 Quad node chassis like this with 4-6 SSD ODS per node and a 2nd IB
HCA, or a similar product w/o onboard IB and a 2 port IB HCA:
http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTFR.cfm
That will give you 4-8 high performance SSD nodes in 2-4U.

* 5 HDD storage nodes, with 8-10 HDDs and 2-4 journal SSDs like this:
http://www.supermicro.com.tw/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
(4 100GB DC S3700 will perform better than 2 200GB ones and give you
smaller failure domains at about the same price).

Christian

>Any other suggestion/comment?
> 
> Thanks a lot!
> 
> Best regards
> 
> German
> 
> 
> On Saturday, September 5, 2015, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Fri, 4 Sep 2015 12:30:12 -0300 German Anders wrote:
> >
> > > Hi cephers,
> > >
> > >I've the following scheme:
> > >
> > > 7x OSD servers with:
> > >
> > Is this a new cluster, total initial deployment?
> >
> > What else are these nodes made of, CPU/RAM/network?
> > While uniform nodes have some appeal (interchangeability, one node down
> > does impact the cluster uniformly) they tend to be compromise
> > solutions. I personally would go with optimized HDD and SSD nodes.
> >
> > > 4x 800GB SSD Intel DC S3510 (OSD-SSD)
> > Only 0.3DWPD, 450TB total in 5 years.
> > If you can correctly predict your write volume and it is below that per
> > SSD, fine. I'd use 3610s, with internal journals.
> >
> > > 3x 120GB SSD Intel DC S3500 (Journals)
> > In this case even more so the S3500 is a bad choice. 3x 135MB/s is
> > nowhere near your likely network speed of 10Gb/s.
> >
> > You will vastly superior performance and endurance with two 200GB S3610
> > (2x 230MB/s) or S3700 (2x365 MB/s)
> >
> > Why the uneven number of journals SSDs?
> > You want uniform utilization, wear. 2 journal SSDs for 6 HDDs would be
> > a good ratio.
> >
> > > 5x 3TB SAS disks (OSD-SAS)
> > >
> > See above, even numbers make a lot more sense.
> >
> > >
> > > The OSD servers are located on two separate Racks with two power
> > > circuits each.
> > >
> > >I would like to know what is the best way to implement this.. use
> > > the 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or
> > > any other suggestion? Also any advice for the crush design?
> > >
> > Nick touched on that already, for right now SSD pools would be
> > definitely better.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com Global OnLine Japan/Fusion
> > Communications
> > http://www.gol.com/
> >
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-06 Thread Brad Hubbard
- Original Message -
> From: "Fangzhe Chang (Fangzhe)" 
> To: ceph-users@lists.ceph.com
> Sent: Saturday, 5 September, 2015 6:26:16 AM
> Subject: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> 
> 
> Hi,
> 
> I’m trying to add a second monitor using ‘ceph-deploy mon new  hostname>’. However, the log file shows the following error:
> 
> 2015-09-04 16:13:54.863479 7f4cbc3f7700 0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 
> 2015-09-04 16:13:54.863491 7f4cbc3f7700 0 -- :6789/0 >>
> :6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0 cs=0 l=0
> c=0x3f29600).failed verifying authorize reply

A couple of things to look at are verifying all your clocks are in sync (ntp
helps here) and making sure you are running ceph-deploy in the directory you 
used
to create the cluster.

> 
> 
> 
> Does anyone know how to resolve this?
> 
> Thanks
> 
> 
> 
> Fangzhe
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Test

2015-09-06 Thread Wukongming
Test
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw potential security issue

2015-09-06 Thread Xusangdi
Hi Cephers,

Recently when I did some tests of RGW functions I found that the swift key of a 
subuser is kept after removing the subuser. As a result, this subuser-swift_key 
pair can still pass authentication system and get an auth-token (without any 
permission though). Moreover, if we create a subuser with the same name later, 
the swift key becomes valid again. I know we can actually delete the key by 
explicitly specifying '--purge-keys', just curious why it is not set as default.

Any thought or comment?

p.s.  You may find more information on ceph tracker: 
http://tracker.ceph.com/issues/12890


Best Regards,
Sangdi Xu
-



???
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph monitor ip address issue

2015-09-06 Thread Willi Fehler

Hello,

I'm trying to setup my first Ceph Cluster on Hammer.

[root@linsrv002 ~]# ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

[root@linsrv002 ~]# ceph -s
cluster 7a8cc185-d7f1-4dd5-9fe6-42cfd5d3a5b7
 health HEALTH_OK
 monmap e1: 3 mons at 
{linsrv001=10.10.10.1:6789/0,linsrv002=10.10.10.2:6789/0,linsrv003=10.10.10.3:6789/0}

election epoch 256, quorum 0,1,2 linsrv001,linsrv002,linsrv003
 mdsmap e60: 1/1/1 up {0=linsrv001=up:active}, 2 up:standby
 osdmap e622: 9 osds: 9 up, 9 in
  pgmap v1216: 384 pgs, 3 pools, 2048 MB data, 532 objects
6571 MB used, 398 GB / 404 GB avail
 384 active+clean

My issue is that I have two networks a public network 192.168.0.0/24 and 
a cluster network 10.10.10.0/24 and my monitors should listen on 
192.168.0.0/24. Later I want to use CephFS over the public network.


[root@linsrv002 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 7a8cc185-d7f1-4dd5-9fe6-42cfd5d3a5b7
mon_initial_members = linsrv001, linsrv002, linsrv003
mon_host = 10.10.10.1,10.10.10.2,10.10.10.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
mon_clock_drift_allowed = 1
public_network = 192.168.0.0/24
cluster_network = 10.10.10.0/24

[root@linsrv002 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 
localhost6.localdomain6

10.10.10.1linsrv001
10.10.10.2linsrv002
10.10.10.3linsrv003

I've deployed my first cluster with ceph-deploy. What should I do to 
have :6789 to be listen on the public network?


Regards - Willi

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com