Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread David
Thanks Wido, those are good pointers indeed :)
So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
saturated (or the controllers) and then go with as many RBD per VM as possible.

Kind Regards,
David Majchrzak

16 jan 2016 kl. 22:26 skrev Wido den Hollander :

> On 01/16/2016 07:06 PM, David wrote:
>> Hi!
>> 
>> We’re planning our third ceph cluster and been trying to find how to
>> maximize IOPS on this one.
>> 
>> Our needs:
>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>> servers)
>> * Pool for storage of many small files, rbd (probably dovecot maildir
>> and dovecot index etc)
>> 
> 
> Not completely NVMe related, but in this case, make sure you use
> multiple disks.
> 
> For MySQL for example:
> 
> - Root disk for OS
> - Disk for /var/lib/mysql (data)
> - Disk for /var/log/mysql (binary log)
> - Maybe even a InnoDB logfile disk
> 
> With RBD you gain more performance by sending I/O into the cluster in
> parallel. So when ever you can, do so!
> 
> Regarding small files, it might be interesting to play with the stripe
> count and stripe size there. By default this is 1 and 4MB. But maybe 16
> and 256k work better here.
> 
> With Dovecot as well, use a different RBD disk for the indexes and a
> different one for the Maildir itself.
> 
> Ceph excels at parallel performance. That is what you want to aim for.
> 
>> So I’ve been reading up on:
>> 
>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>> 
>> and ceph-users from october 2015:
>> 
>> http://www.spinics.net/lists/ceph-users/msg22494.html
>> 
>> We’re planning something like 5 OSD servers, with:
>> 
>> * 4x 1.2TB Intel S3510
>> * 8st 4TB HDD
>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>> one for HDD pool journal)
>> * 2x 80GB Intel S3510 raid1 for system
>> * 256GB RAM
>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>> 
>> This cluster will probably run Hammer LTS unless there are huge
>> improvements in Infernalis when dealing 4k IOPS.
>> 
>> The first link above hints at awesome performance. The second one from
>> the list not so much yet.. 
>> 
>> Is anyone running Hammer or Infernalis with a setup like this?
>> Is it a sane setup?
>> Will we become CPU constrained or can we just throw more RAM on it? :D
>> 
>> Kind Regards,
>> David Majchrzak
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS

2016-01-17 Thread James Gallagher
Hi,

I'm looking to implement the CephFS on my Firefly release (v0.80) with
an XFS native file system, but so far I'm having some difficulties. After
following the ceph/qsg and creating a storage cluster, I have the following
topology

admin node - mds/mon
   osd1
   osd2

Ceph health is OK, ceph -s shows

monmap e1: 1 mons at {node1=192.168.43.129:6789/0}, election epoch 2,
quorum 0 node 1
msdmap e6: 1/1/1 up {0=node1=up:active}
osdmap e10: 2osds: 2 up, 2 in
active + clean and so on

However, unfortunately when I use the guide here:

http://docs.ceph.com/docs/master/cephfs/kernel/

and try the command

sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs -o
name=admin-node,secretfile=admin.secret

where admin-node=hostname of admin node and where admin.secret is the
string taken from ceph.client.admin.keyring without the unnecessary bits

I then get:
mount: wrong fs type, bad option, bad superblock on 192.168.43.129:6789
missing codepage or helper program or ...
for several filesystems e.g. nfs cifs
need a /sbin./mount.type helper program

This leads me to believe that there is a problem with XFS, but this is
supported with this version of Ceph so I don't really know anymore.

When I try the command

sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs -o
name=admin-node,secret={secretkey}

I get libceph: auth method x error -1
mount: permission denied

and when I try
sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs

I get
no secret set ...
error -22 on auth protocol 2 init
then the whole mount: wrong fs type, bad option jargon again


Any ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread Tyler Bishop
The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
release coming up.

Based on benchmarks and testing, sandisk has really contributed heavily on the 
tuning aspects and are promising 90%+ native iop of a drive in the cluster.

The biggest changes will come from the memory allocation with writes.  Latency 
is going to be a lot lower.


- Original Message -
From: "David" 
To: "Wido den Hollander" 
Cc: ceph-users@lists.ceph.com
Sent: Sunday, January 17, 2016 6:49:25 AM
Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs

Thanks Wido, those are good pointers indeed :)
So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
saturated (or the controllers) and then go with as many RBD per VM as possible.

Kind Regards,
David Majchrzak

16 jan 2016 kl. 22:26 skrev Wido den Hollander :

> On 01/16/2016 07:06 PM, David wrote:
>> Hi!
>> 
>> We’re planning our third ceph cluster and been trying to find how to
>> maximize IOPS on this one.
>> 
>> Our needs:
>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>> servers)
>> * Pool for storage of many small files, rbd (probably dovecot maildir
>> and dovecot index etc)
>> 
> 
> Not completely NVMe related, but in this case, make sure you use
> multiple disks.
> 
> For MySQL for example:
> 
> - Root disk for OS
> - Disk for /var/lib/mysql (data)
> - Disk for /var/log/mysql (binary log)
> - Maybe even a InnoDB logfile disk
> 
> With RBD you gain more performance by sending I/O into the cluster in
> parallel. So when ever you can, do so!
> 
> Regarding small files, it might be interesting to play with the stripe
> count and stripe size there. By default this is 1 and 4MB. But maybe 16
> and 256k work better here.
> 
> With Dovecot as well, use a different RBD disk for the indexes and a
> different one for the Maildir itself.
> 
> Ceph excels at parallel performance. That is what you want to aim for.
> 
>> So I’ve been reading up on:
>> 
>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>> 
>> and ceph-users from october 2015:
>> 
>> http://www.spinics.net/lists/ceph-users/msg22494.html
>> 
>> We’re planning something like 5 OSD servers, with:
>> 
>> * 4x 1.2TB Intel S3510
>> * 8st 4TB HDD
>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>> one for HDD pool journal)
>> * 2x 80GB Intel S3510 raid1 for system
>> * 256GB RAM
>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>> 
>> This cluster will probably run Hammer LTS unless there are huge
>> improvements in Infernalis when dealing 4k IOPS.
>> 
>> The first link above hints at awesome performance. The second one from
>> the list not so much yet.. 
>> 
>> Is anyone running Hammer or Infernalis with a setup like this?
>> Is it a sane setup?
>> Will we become CPU constrained or can we just throw more RAM on it? :D
>> 
>> Kind Regards,
>> David Majchrzak
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread David
That is indeed great news! :)
Thanks for the heads up.

Kind Regards,
David Majchrzak


17 jan 2016 kl. 21:34 skrev Tyler Bishop :

> The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
> release coming up.
> 
> Based on benchmarks and testing, sandisk has really contributed heavily on 
> the tuning aspects and are promising 90%+ native iop of a drive in the 
> cluster.
> 
> The biggest changes will come from the memory allocation with writes.  
> Latency is going to be a lot lower.
> 
> 
> - Original Message -
> From: "David" 
> To: "Wido den Hollander" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, January 17, 2016 6:49:25 AM
> Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs
> 
> Thanks Wido, those are good pointers indeed :)
> So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
> saturated (or the controllers) and then go with as many RBD per VM as 
> possible.
> 
> Kind Regards,
> David Majchrzak
> 
> 16 jan 2016 kl. 22:26 skrev Wido den Hollander :
> 
>> On 01/16/2016 07:06 PM, David wrote:
>>> Hi!
>>> 
>>> We’re planning our third ceph cluster and been trying to find how to
>>> maximize IOPS on this one.
>>> 
>>> Our needs:
>>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>>> servers)
>>> * Pool for storage of many small files, rbd (probably dovecot maildir
>>> and dovecot index etc)
>>> 
>> 
>> Not completely NVMe related, but in this case, make sure you use
>> multiple disks.
>> 
>> For MySQL for example:
>> 
>> - Root disk for OS
>> - Disk for /var/lib/mysql (data)
>> - Disk for /var/log/mysql (binary log)
>> - Maybe even a InnoDB logfile disk
>> 
>> With RBD you gain more performance by sending I/O into the cluster in
>> parallel. So when ever you can, do so!
>> 
>> Regarding small files, it might be interesting to play with the stripe
>> count and stripe size there. By default this is 1 and 4MB. But maybe 16
>> and 256k work better here.
>> 
>> With Dovecot as well, use a different RBD disk for the indexes and a
>> different one for the Maildir itself.
>> 
>> Ceph excels at parallel performance. That is what you want to aim for.
>> 
>>> So I’ve been reading up on:
>>> 
>>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>>> 
>>> and ceph-users from october 2015:
>>> 
>>> http://www.spinics.net/lists/ceph-users/msg22494.html
>>> 
>>> We’re planning something like 5 OSD servers, with:
>>> 
>>> * 4x 1.2TB Intel S3510
>>> * 8st 4TB HDD
>>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>>> one for HDD pool journal)
>>> * 2x 80GB Intel S3510 raid1 for system
>>> * 256GB RAM
>>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>>> 
>>> This cluster will probably run Hammer LTS unless there are huge
>>> improvements in Infernalis when dealing 4k IOPS.
>>> 
>>> The first link above hints at awesome performance. The second one from
>>> the list not so much yet.. 
>>> 
>>> Is anyone running Hammer or Infernalis with a setup like this?
>>> Is it a sane setup?
>>> Will we become CPU constrained or can we just throw more RAM on it? :D
>>> 
>>> Kind Regards,
>>> David Majchrzak
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>> 
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cache pool redundancy requirements.

2016-01-17 Thread Tyler Bishop
Based off Sebastiens design I had some thoughts: 
http://www.sebastien-han.fr/images/ceph-cache-pool-compute-design.png 

Hypervisors are for obvious reason more susceptible to crashes and reboots for 
security updates. Since ceph is utilizing a standard pool for the cache tier it 
creates a requirement for placement group stability. IE: We cannot use a pool 
with only 1 PG replica required. The ideal configuration would be to utilize a 
single replica ssd cache pool as READ ONLY, and all writes will be sent to the 
base tier ssd journals, this way your getting quick acks and fast reads without 
any lost flash capacity for redundancy. 

Has anyone tested a failure with a read only cache pool that utilizes a single 
replica? Does ceph simply fetch the data and place it to another pg? The cache 
pool should be able to sustain drive failures with 1 replica because its not 
needed for consistency. 

Interesting topic here.. curious if anyone has tried this. 

Our current architecture utilizes 48 hosts with 2x 1T SSD each as a 2 replica 
ssd pool. We have 4 host with 52x 6T disk for a capacity pool. We would like to 
run the base tier on the spindles with the SSD as a 100% utilized cache tier 
for busy pools. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cache pool redundancy requirements.

2016-01-17 Thread Tyler Bishop
Adding to this thought, even if you are using a single replica for the cache 
pool, will ceph scrub the cached block against the base tier? What if you have 
corruption in your cache? 


From: "Tyler Bishop"  
To: ceph-users@lists.ceph.com 
Cc: "Sebastien han"  
Sent: Sunday, January 17, 2016 3:47:13 PM 
Subject: Ceph Cache pool redundancy requirements. 

Based off Sebastiens design I had some thoughts: 
http://www.sebastien-han.fr/images/ceph-cache-pool-compute-design.png 

Hypervisors are for obvious reason more susceptible to crashes and reboots for 
security updates. Since ceph is utilizing a standard pool for the cache tier it 
creates a requirement for placement group stability. IE: We cannot use a pool 
with only 1 PG replica required. The ideal configuration would be to utilize a 
single replica ssd cache pool as READ ONLY, and all writes will be sent to the 
base tier ssd journals, this way your getting quick acks and fast reads without 
any lost flash capacity for redundancy. 

Has anyone tested a failure with a read only cache pool that utilizes a single 
replica? Does ceph simply fetch the data and place it to another pg? The cache 
pool should be able to sustain drive failures with 1 replica because its not 
needed for consistency. 

Interesting topic here.. curious if anyone has tried this. 

Our current architecture utilizes 48 hosts with 2x 1T SSD each as a 2 replica 
ssd pool. We have 4 host with 52x 6T disk for a capacity pool. We would like to 
run the base tier on the spindles with the SSD as a 100% utilized cache tier 
for busy pools. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition

2016-01-17 Thread Stuart Longland
On 16/01/16 05:39, Robert LeBlanc wrote:
> If you are not booting from the GPT disk, you don't need the EFI
> partition (or any special boot partition). The required backup FAT is
> usually put at the end where there is usually some free space anyway.
> It has been a long time since I've converted from MBR to GPT, but it
> didn't require any resizing that I remember. I'd test it in a VM or
> similar to make sure you understand the process. You will also have to
> manually add the Ceph Journal UUID to the partition after the
> conversion for it to all work automatically.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

Yeah, I am actually booting from this disk.  The machine has 3 drives:
two 3TB HDDs and one 60GB SSD.

The SSD is split up into thirds, with one third being the OS (and /boot)
and the rest are two journal partitions for the two OSD disks.

The two OSDs are on the first two SATA slots (these are the fastest ones
on this particular motherboard) with the boot disk on the third slot;
thus seen as /dev/sdc.

There's boot firmware to consider as well, pretty sure it's a switch in
the BIOS to switch between UEFI and legacy mode, and UEFI is required
for booting GPT.
-- 
 _ ___ Stuart Longland - Systems Engineer
\  /|_) |   T: +61 7 3535 9619
 \/ | \ | 38b Douglas StreetF: +61 7 3535 9699
   SYSTEMSMilton QLD 4064   http://www.vrt.com.au
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-17 Thread deeepdish
Hi Everyone,

Looking for a double check of my logic and crush map..

Overview:

- osdgroup bucket type defines failure domain within a host of 5 OSDs + 1 SSD.  
 Therefore 5 OSDs (all utilizing the same journal) constitute an osdgroup 
bucket.   Each host has 4 osdgroups.
- 6 monitors
- Two node cluster
- Each node:
-   20 OSDs
-   4 SSDs
-   4 osdgroups

Desired Crush Rule outcome:
- Assuming a pool with min_size=2 and size=4, all each node would contain a 
redundant copy of each object.   Should any of the hosts fail, access to data 
would be uninterrupted.   

Current Crush Rule outcome:
- There are 4 copies of each object, however I don’t believe each node has a 
redundant copy of each object, when a node fails, data is NOT accessible until 
ceph rebuilds itself / node becomes accessible again.

I susepct my crush is not right, and to remedy it may take some time and cause 
cluster to be unresponsive / unavailable.Is there a way / method to apply 
substantial crush changes gradually to a cluster?   

Thanks for your help.


Current crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39

# types
type 0 osd
type 1 osdgroup
type 2 host
type 3 rack
type 4 site
type 5 root

# buckets
osdgroup b02s08-osdgroupA {
id -81  # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.620
item osd.1 weight 3.620
item osd.2 weight 3.620
item osd.3 weight 3.620
item osd.4 weight 3.620
}
osdgroup b02s08-osdgroupB {
id -82  # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.5 weight 3.620
item osd.6 weight 3.620
item osd.7 weight 3.620
item osd.8 weight 3.620
item osd.9 weight 3.620
}
osdgroup b02s08-osdgroupC {
id -83  # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.10 weight 3.620
item osd.11 weight 3.620
item osd.12 weight 3.620
item osd.13 weight 3.620
item osd.14 weight 5.440
}
osdgroup b02s08-osdgroupD {
id -84  # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.15 weight 3.620
item osd.16 weight 3.620
item osd.17 weight 3.620
item osd.18 weight 3.620
item osd.19 weight 5.440
}
host b02s08 {
id -80  # do not change unnecessarily
# weight 76.040
alg straw
hash 0  # rjenkins1
item b02s08-osdgroupA weight 18.100
item b02s08-osdgroupB weight 18.100
item b02s08-osdgroupC weight 19.920
item b02s08-osdgroupD weight 19.920
}
osdgroup b02s12-osdgroupA {
id -121 # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.20 weight 3.620
item osd.21 weight 3.620
item osd.22 weight 3.620
item osd.23 weight 3.620
item osd.24 weight 3.620
}
osdgroup b02s12-osdgroupB {
id -122 # do not change unnecessarily
# weight 18.100
alg straw
hash 0  # rjenkins1
item osd.25 weight 3.620
item osd.26 weight 3.620
item osd.27 weight 3.620
item osd.28 weight 3.620
item osd.29 weight 3.620
}
osdgroup b02s12-osdgroupC {
id -123 # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.30 weight 3.620
item osd.31 weight 3.620
item osd.32 weight 3.620
item osd.33 weight 3.620
item osd.34 weight 5.440
}
osdgroup b02s12-osdgroupD {
id -124 # do not change unnecessarily
# weight 19.920
alg straw
hash 0  # rjenkins1
item osd.35 weight 3.620
item osd.36 weight 3.620
item osd.37 weight 3.620
item osd.38 weight 3.620
item osd.39 weight 5.440
}
host b02s12 {
id -120 # do not change unnecessarily
# weight 76.040
alg straw
  

[ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Francois Lafont
Hello,

Can someone explain me the difference between df and du commands
concerning the data used in my cephfs? And which is the correct value,
958M or 4.2G?

~# du -sh /mnt/cephfs
958M/mnt/cephfs

~# df -h /mnt/cephfs/
Filesystem  Size  Used Avail Use% Mounted on
ceph-fuse55T  4.2G   55T   1% /mnt/cephfs

My client node is a "classical" Ubuntu Trusty, kernel 3.13 but as you
can see I'm using ceph-fuse. The cluster nodes are "classical" Ubuntu
Trusty nodes too.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread Christian Balzer

Hello,

On Sat, 16 Jan 2016 19:06:07 +0100 David wrote:

> Hi!
> 
> We’re planning our third ceph cluster and been trying to find how to
> maximize IOPS on this one.
> 
> Our needs:
> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
> servers)
> * Pool for storage of many small files, rbd (probably dovecot maildir
> and dovecot index etc)
>
I'm running dovecot for several 100k users on 2-node DRBD clusters and for
a mail archive server for a few hundred users backed by Ceph/RBD.
The later works fine (it's not that busy), but I wouldn't consider
replacing the DRBD clusters with Ceph/RBD at this time (higher investment
in storage 3x vs 2x and lower performance of course).

Depending on your use case you may be just fine of course.

> So I’ve been reading up on:
> 
> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
> 
> and ceph-users from october 2015:
> 
> http://www.spinics.net/lists/ceph-users/msg22494.html
> 
> We’re planning something like 5 OSD servers, with:
> 
> * 4x 1.2TB Intel S3510
I'd be wary of that.
As in, you're spec'ing the best Intel SSDs money can buy below for
journals, but the least write-endurable Intel DC SSDs for OSDs here.
Note that write amplification (beyond Ceph and FS journals) is very much a
thing, especially with small files. 
There's a mail about this by me in the ML archives somewhere:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

Unless you're very sure about this being a read-mostly environment I'd go
with 3610's at least.

> * 8st 4TB HDD
> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
> one for HDD pool journal)
You may be better off (cost and SPOF wise) with 2 200GB S3700 (not 3710)
for the HDD journals, but then again that won't fit into your case, won't
it...
Given the IOPS limits in Ceph as it is, you're unlikely to see much of
difference if you forgo a journal for the SSDs and use shared journals with
DC S3610 or 3710 OSD SSDs. 
Note that as far as pure throughput is concerned (in most operations the
least critical factor) your single journal SSD will limit things to the
speed of 2 (of your 4) storage SSDs.
But then again, your network is probably saturated before that.

> * 2x 80GB Intel S3510 raid1 for system
> * 256GB RAM
Plenty. ^o^

> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
> 
Not sure about Jewel, but SSD OSDs will eat pretty much any and all CPU
cycles you can throw at them.
This also boils down to the question if having mixed HDD/SSD storage nodes
(with the fun of having to set "osd crush update on start = false") is a
good idea or not, as opposed to nodes that are optimized for their
respective storage hardware (CPU, RAM, network wise).

Regards,

Christian
> This cluster will probably run Hammer LTS unless there are huge
> improvements in Infernalis when dealing 4k IOPS.
> 
> The first link above hints at awesome performance. The second one from
> the list not so much yet.. 
> 
> Is anyone running Hammer or Infernalis with a setup like this?
> Is it a sane setup?
> Will we become CPU constrained or can we just throw more RAM on it? :D
> 
> Kind Regards,
> David Majchrzak

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Francois Lafont
On 18/01/2016 04:19, Francois Lafont wrote:

> ~# du -sh /mnt/cephfs
> 958M  /mnt/cephfs
> 
> ~# df -h /mnt/cephfs/
> Filesystem  Size  Used Avail Use% Mounted on
> ceph-fuse55T  4.2G   55T   1% /mnt/cephfs

Even with the option --apparent-size, the size are different (but closer 
indeed):

~# du -sh --apparent-size /mnt/cephfs
3.8G/mnt/cephfs


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Adam Tygart
As I understand it:

4.2G is used by ceph (all replication, metadata, et al) it is a sum of
all the space "used" on the osds.
958M is the actual space the data in cephfs is using (without replication).
3.8G means you have some sparse files in cephfs.

'ceph df detail' should return something close to 958MB used for your
cephfs "data" pool. "RAW USED" should be close to 4.2GB

--
Adam

On Sun, Jan 17, 2016 at 9:53 PM, Francois Lafont  wrote:
> On 18/01/2016 04:19, Francois Lafont wrote:
>
>> ~# du -sh /mnt/cephfs
>> 958M  /mnt/cephfs
>>
>> ~# df -h /mnt/cephfs/
>> Filesystem  Size  Used Avail Use% Mounted on
>> ceph-fuse55T  4.2G   55T   1% /mnt/cephfs
>
> Even with the option --apparent-size, the size are different (but closer 
> indeed):
>
> ~# du -sh --apparent-size /mnt/cephfs
> 3.8G/mnt/cephfs
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com