[ceph-users] About IOPS num

2014-08-29 Thread lixue...@chinacloud.com.cn

guys:
There's a ceph cluster working and nodes were connected with  10Gb cable. 
We defined  fio's  bs=4k and the object size of rbd is 4MB.
Client node was connect with the cluster via 1000Mb cable. Finally the IOPS is 
27 ,when we control the latency of fio under 20ms. 4MB*27=108MB nearly  
1000Mb,so  is it reasonable?


lixue...@chinacloud.com.cn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Andrey Korolyov
On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy  wrote:
> Thanks Haomai !
>
> Here is some of the data from my setup.
>
>
>
> --
>
> Set up:
>
> 
>
>
>
> 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) ->
> one OSD. 5 client m/c with 12 core cpu and each running two instances of
> ceph_smalliobench (10 clients total). Network is 10GbE.
>
>
>
> Workload:
>
> -
>
>
>
> Small workload – 20K objects with 4K size and io_size is also 4K RR. The
> intent is to serve the ios from memory so that it can uncover the
> performance problems within single OSD.
>
>
>
> Results from Firefly:
>
> --
>
>
>
> Single client throughput is ~14K iops, but as the number of client increases
> the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu
> cores are used.
>
>
>
> Result with latest master:
>
> --
>
>
>
> Single client is ~14K iops, but scaling as number of clients increases. 10
> clients ~107K iops. ~25 cpu cores are used.
>
>
>
> --
>
>
>
>
>
> More realistic workload:
>
> -
>
> Let’s see how it is performing while > 90% of the ios are served from disks
>
> Setup:
>
> ---
>
> 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8
> SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for
> client running fio/vdbench. 4 rbds are configured with ‘noshare’ option. 40
> GbE network
>
>
>
> Workload:
>
> 
>
>
>
> 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data.  Io_size = 4K RR.
>
>
>
> Results from Firefly:
>
> 
>
>
>
> Aggregated output while 4 rbd clients stressing the cluster in parallel is
> ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember
> precisely)
>
>
>
> Results from latest master:
>
> 
>
>
>
> Aggregated output while 4 rbd clients stressing the cluster in parallel is
> ~120K IOPS , cpu is 7% idle i.e  ~37-38 cpu cores.
>
>
>
> Hope this helps.
>
>
>

Thanks Roy, the results are very promising!

Just two moments - are numbers from above related to the HT cores or
you renormalized the result for real ones? And what about percentage
of I/O time/utilization in this test was (if you measured this ones)?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Haomai Wang
On Fri, Aug 29, 2014 at 4:03 PM, Andrey Korolyov  wrote:

> On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy 
> wrote:
> > Thanks Haomai !
> >
> > Here is some of the data from my setup.
> >
> >
> >
> >
> --
> >
> > Set up:
> >
> > 
> >
> >
> >
> > 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data)
> ->
> > one OSD. 5 client m/c with 12 core cpu and each running two instances of
> > ceph_smalliobench (10 clients total). Network is 10GbE.
> >
> >
> >
> > Workload:
> >
> > -
> >
> >
> >
> > Small workload – 20K objects with 4K size and io_size is also 4K RR. The
> > intent is to serve the ios from memory so that it can uncover the
> > performance problems within single OSD.
> >
> >
> >
> > Results from Firefly:
> >
> > --
> >
> >
> >
> > Single client throughput is ~14K iops, but as the number of client
> increases
> > the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10
> cpu
> > cores are used.
> >
> >
> >
> > Result with latest master:
> >
> > --
> >
> >
> >
> > Single client is ~14K iops, but scaling as number of clients increases.
> 10
> > clients ~107K iops. ~25 cpu cores are used.
> >
> >
> >
> >
> --
> >
> >
> >
> >
> >
> > More realistic workload:
> >
> > -
> >
> > Let’s see how it is performing while > 90% of the ios are served from
> disks
> >
> > Setup:
> >
> > ---
> >
> > 40 cpu core server as a cluster node (single node cluster) with 64 GB
> RAM. 8
> > SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for
> > client running fio/vdbench. 4 rbds are configured with ‘noshare’ option.
> 40
> > GbE network
> >
> >
> >
> > Workload:
> >
> > 
> >
> >
> >
> > 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data.  Io_size = 4K RR.
> >
> >
> >
> > Results from Firefly:
> >
> > 
> >
> >
> >
> > Aggregated output while 4 rbd clients stressing the cluster in parallel
> is
> > ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember
> > precisely)
>

Good job! I would like to perform it later.


> >
> >
> >
> > Results from latest master:
> >
> > 
> >
> >
> >
> > Aggregated output while 4 rbd clients stressing the cluster in parallel
> is
> > ~120K IOPS , cpu is 7% idle i.e  ~37-38 cpu cores.
> >
> >
> >
> > Hope this helps.
> >
> >
> >
>
> Thanks Roy, the results are very promising!
>
> Just two moments - are numbers from above related to the HT cores or
> you renormalized the result for real ones? And what about percentage
> of I/O time/utilization in this test was (if you measured this ones)?
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RFC: A preliminary Chinese version of Calamari

2014-08-29 Thread Li Wang

Hi,
  We have set up a preliminary Chinese version of Calamari at
https://github.com/xiaoxianxia/calamari-clients-cn, major jobs done
are translating the English words on the web interface into Chinese,
we did not change the localization infrastructure, any help
in this direction are appreciated,  also any suggestions, tests
and technical involvement are welcome, to make it ready to
be merged to the upstream.

Cheers,
Li Wang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Sebastien Han
Thanks a lot for the answers, even if we drifted from the main subject a little 
bit.
Thanks Somnath for sharing this, when can we expect any codes that might 
improve _write_ performance?

@Mark thanks trying this :)
Unfortunately using nobarrier and another dedicated SSD for the journal  (plus 
your ceph setting) didn’t bring much, now I can reach 3,5K IOPS.
By any chance, would it be possible for you to test with a single OSD SSD?

On 28 Aug 2014, at 18:11, Sebastien Han  wrote:

> Hey all,
> 
> It has been a while since the last thread performance related on the ML :p
> I’ve been running some experiment to see how much I can get from an SSD on a 
> Ceph cluster.
> To achieve that I did something pretty simple:
> 
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i’d like to keep this realistic since in a real 
> deployment i’ll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same results
> * no updatedb running
> 
> About the box:
> 
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn’t help here)
> 
> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops 
> with random 4k writes (my fio results)
> As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom 
> guys!).
> 
> O_DIECT and D_SYNC don’t seem to be a problem for the SSD:
> 
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
> 
> # du -sh rand.file
> 256Mrand.file
> 
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
> 
> See my ceph.conf:
> 
> [global]
>  auth cluster required = cephx
>  auth service required = cephx
>  auth client required = cephx
>  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>  osd pool default pg num = 4096
>  osd pool default pgp num = 4096
>  osd pool default size = 2
>  osd crush chooseleaf type = 0
> 
>   debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
> 
> [mon]
>  mon osd down out interval = 600
>  mon osd min down reporters = 13
>[mon.ceph-01]
>host = ceph-01
>mon addr = 172.20.20.171
>  [mon.ceph-02]
>host = ceph-02
>mon addr = 172.20.20.172
>  [mon.ceph-03]
>host = ceph-03
>mon addr = 172.20.20.173
> 
>debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
> 
> [osd]
>  osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048
> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>  osd journal size = 20480
>  cluster_network = 172.20.20.0/24
>  public_network = 172.20.20.0/24
>  osd mon heartbeat interval = 30
>  # Performance tuning
>  filestore merge threshold = 40
>  filestore split multiple = 8
>  osd op threads = 8
>  # Recovery tuning
>  osd recovery max active = 1
>  osd max backfills = 1
>  osd recovery op priority = 1
> 
> 
>debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
> 
> Disabling all debugging made me win 200/300 more IOPS.
> 
> See my fio template:
> 
> [global]
> #logging
> #write_iops_log=write_iops_log
> #write_bw_log=write_bw_log
> #write_lat_log=write_lat_lo
> 
> time_based
> runtime=60
> 
> ioengine=rbd
> clientname=admin
> 

[ceph-users] Difference between "object rm" and "object unlink" ?

2014-08-29 Thread zhu qiang
Hi all,
   From radosgw-admin commond :
  # radosgw-admin object rm --object=my_test_file.txt --bucket=test_buck
  # radosgw-admin object unlink --object=my_test_file.txt --bucket=test_buck
  The both above did  "delete"   the object  from bucket, both got the right 
result.
  What is  the difference between these two commands ?  Could some one tell me 
the detailes ?
Thanks very much.
 
  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Dan Van Der Ster
Hi Sebastien,
Here’s my recipe for max IOPS on a _testing_ instance with SSDs:

  osd op threads = 2
  osd disk threads = 2
  journal max write bytes = 1048576
  journal queue max bytes = 1048576
  journal max write entries = 1
  journal queue max ops = 5
  filestore op threads = 2
  filestore max sync interval = 60
  filestore queue max ops = 5
  filestore queue max bytes = 1048576
  filestore queue committing max bytes = 1048576
  filestore queue committing max ops = 5
  filestore wbthrottle xfs bytes start flusher = 4194304000
  filestore wbthrottle xfs bytes hard limit = 4194304
  filestore wbthrottle xfs ios start flusher = 5
  filestore wbthrottle xfs ios hard limit = 50
  filestore wbthrottle xfs inodes start flusher = 5
  filestore wbthrottle xfs inodes hard limit = 50

Basically the goal there is to ensure no IOs are blocked from entering any 
queue. (And don’t run that in production!)
IIRC I can get up to around 5000 IOPS to a single fio/rbd client. Related to 
the sync interval, I was also playing with vm.dirty_expire_centisecs and 
vm.dirty_writeback_centisecs to disable the background page flushing (disables 
FileStore flushing). That way, the only disk activity becomes the journal 
writes. You can confirm that with iostat.

Another thing that comes to mind is at some point a single fio/librbd client 
will be the bottleneck. Did you try running two simultaneous fio’s (then adding 
the results)?

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 28 Aug 2014, at 18:11, Sebastien Han  wrote:

> Hey all,
> 
> It has been a while since the last thread performance related on the ML :p
> I’ve been running some experiment to see how much I can get from an SSD on a 
> Ceph cluster.
> To achieve that I did something pretty simple:
> 
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i’d like to keep this realistic since in a real 
> deployment i’ll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same results
> * no updatedb running
> 
> About the box:
> 
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn’t help here)
> 
> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops 
> with random 4k writes (my fio results)
> As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom 
> guys!).
> 
> O_DIECT and D_SYNC don’t seem to be a problem for the SSD:
> 
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
> 
> # du -sh rand.file
> 256Mrand.file
> 
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
> 
> See my ceph.conf:
> 
> [global]
>  auth cluster required = cephx
>  auth service required = cephx
>  auth client required = cephx
>  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>  osd pool default pg num = 4096
>  osd pool default pgp num = 4096
>  osd pool default size = 2
>  osd crush chooseleaf type = 0
> 
>   debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
> 
> [mon]
>  mon osd down out interval = 600
>  mon osd min down reporters = 13
>[mon.ceph-01]
>host = ceph-01
>mon addr = 172.20.20.171
>  [mon.ceph-02]
>host = ceph-02
>mon addr = 172.20.20.172
>  [mon.ceph-03]
>host = ceph-03
>mon addr = 172.20.20.173
> 
>debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
> 
> [osd]
>  osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048
> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>  osd journal size = 20480
>  cluster_network = 172.20.20.0/24
>  pub

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Kasper Dieter
Hi Sébastien,

On Thu, Aug 28, 2014 at 06:11:37PM +0200, Sebastien Han wrote:
> Hey all,
(...)
> We have been able to reproduce this on 3 distinct platforms with some 
> deviations (because of the hardware) but the behaviour is the same.
> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS 
> SSD is a bit frustrating :).

Yes,
it's the OSD code running thru ~20k lines of code on every IO using ~1000 
Systems-Calls for 1 real IO.
Please have a look at the attached page, which I presented last Sept at SDC.

=> Ceph (v0.61) used 1600 mys for a 4k IO, SSD ~60 mys, Network ~20-200 mys
(on v0.84 and all debug set to 0/0 it is still ~ 1100 mys)

=> so, tune or re-write the OSD code ... (I'm serious about this statement)


Mit freundlichen Grüßen / Best regards
Dieter 


> 
> Cheers.
>  
> Sébastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien@enovance.com 
> Address : 11 bis, rue Roquépine - 75008 Paris
> Web : www.enovance.com - Twitter : @enovance 
> 


> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Ceph-OSD-TAT.pdf
Description: Adobe PDF document
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Mark Nelson

On 08/29/2014 06:10 AM, Dan Van Der Ster wrote:

Hi Sebastien,
Here’s my recipe for max IOPS on a _testing_ instance with SSDs:

   osd op threads = 2


With SSDs, In the past I've seen increasing the osd op thread count can 
help random reads.



   osd disk threads = 2
   journal max write bytes = 1048576
   journal queue max bytes = 1048576
   journal max write entries = 1
   journal queue max ops = 5
   filestore op threads = 2
   filestore max sync interval = 60
   filestore queue max ops = 5
   filestore queue max bytes = 1048576
   filestore queue committing max bytes = 1048576
   filestore queue committing max ops = 5
   filestore wbthrottle xfs bytes start flusher = 4194304000
   filestore wbthrottle xfs bytes hard limit = 4194304
   filestore wbthrottle xfs ios start flusher = 5
   filestore wbthrottle xfs ios hard limit = 50
   filestore wbthrottle xfs inodes start flusher = 5
   filestore wbthrottle xfs inodes hard limit = 50


It's also probably worth trying disabling all in-memory logging. 
Unfortunately I don't think we have a global flag to do this, so you 
have to do it on a per-log basis which is annoying.




Basically the goal there is to ensure no IOs are blocked from entering any 
queue. (And don’t run that in production!)
IIRC I can get up to around 5000 IOPS to a single fio/rbd client. Related to 
the sync interval, I was also playing with vm.dirty_expire_centisecs and 
vm.dirty_writeback_centisecs to disable the background page flushing (disables 
FileStore flushing). That way, the only disk activity becomes the journal 
writes. You can confirm that with iostat.

Another thing that comes to mind is at some point a single fio/librbd client 
will be the bottleneck. Did you try running two simultaneous fio’s (then adding 
the results)?

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 28 Aug 2014, at 18:11, Sebastien Han  wrote:


Hey all,

It has been a while since the last thread performance related on the ML :p
I’ve been running some experiment to see how much I can get from an SSD on a 
Ceph cluster.
To achieve that I did something pretty simple:

* Debian wheezy 7.6
* kernel from debian 3.14-0.bpo.2-amd64
* 1 cluster, 3 mons (i’d like to keep this realistic since in a real deployment 
i’ll use 3)
* 1 OSD backed by an SSD (journal and osd data on the same device)
* 1 replica count of 1
* partitions are perfectly aligned
* io scheduler is set to noon but deadline was showing the same results
* no updatedb running

About the box:

* 32GB of RAM
* 12 cores with HT @ 2,4 GHz
* WB cache is enabled on the controller
* 10Gbps network (doesn’t help here)

The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops 
with random 4k writes (my fio results)
As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom 
guys!).

O_DIECT and D_SYNC don’t seem to be a problem for the SSD:

# dd if=/dev/urandom of=rand.file bs=4k count=65536
65536+0 records in
65536+0 records out
268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s

# du -sh rand.file
256Mrand.file

# dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
65536+0 records in
65536+0 records out
268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s

See my ceph.conf:

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
  osd pool default pg num = 4096
  osd pool default pgp num = 4096
  osd pool default size = 2
  osd crush chooseleaf type = 0

   debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[mon]
  mon osd down out interval = 600
  mon osd min down reporters = 13
[mon.ceph-01]
host = ceph-01
mon addr = 172.20.20.171
  [mon.ceph-02]
host = ceph-02
mon addr = 172.20.20.172
  [mon.ceph-03]
host = ceph-03
mon addr = 172.20.20.173

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[osd]

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Sebastien Han
@Dan: thanks for sharing your config, with all your flags I don’t seem to get 
more that 3,4K IOPS and they even seem to slow me down :( This is really weird.
Yes I already tried to run to simultaneous processes and only half of 3,4K for 
each of them.

@Kasper: thanks for these results, I believe some improvement could be made in 
the code as well :).

FYI I just tried on Ubuntu 12.04 and it looks a bit better because I’m getting 
iops=3783.

On 29 Aug 2014, at 13:10, Dan Van Der Ster  wrote:

> vm.dirty_expire_centisecs


Cheers.
 
Sébastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien@enovance.com 
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Mark Nelson
Excellent, I've been meaning to check into how the TCP transport is 
going.  Are you using a hybrid threadpool/epoll approach?  That I 
suspect would be very effective at reducing context switching, 
especially compared to what we do now.


Mark

On 08/28/2014 10:40 PM, Matt W. Benjamin wrote:

Hi,

There's also an early-stage TCP transport implementation for Accelio, also 
EPOLL-based.  (We haven't attempted to run Ceph protocols over it yet, to my 
knowledge, but it should be straightforward.)

Regards,

Matt

- "Haomai Wang"  wrote:


Hi Roy,


As for messenger level, I have some very early works on
it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains a
new messenger implementation which support different event mechanism.
It looks like at least one more week to make it work.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Matt W. Benjamin
Hi Mark,

Yeah.  The application defines portals which are active threaded, then the 
transport layer is servicing the portals with EPOLL.

Matt

- "Mark Nelson"  wrote:

> Excellent, I've been meaning to check into how the TCP transport is 
> going.  Are you using a hybrid threadpool/epoll approach?  That I 
> suspect would be very effective at reducing context switching, 
> especially compared to what we do now.
> 
> Mark
> 
> On 08/28/2014 10:40 PM, Matt W. Benjamin wrote:
> > Hi,
> >
> > There's also an early-stage TCP transport implementation for
> Accelio, also EPOLL-based.  (We haven't attempted to run Ceph
> protocols over it yet, to my knowledge, but it should be
> straightforward.)
> >
> > Regards,
> >
> > Matt
> >
> > - "Haomai Wang"  wrote:
> >
> >> Hi Roy,
> >>
> >>
> >> As for messenger level, I have some very early works on
> >> it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains
> a
> >> new messenger implementation which support different event
> mechanism.
> >> It looks like at least one more week to make it work.
> >>
> >
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Somnath Roy
Results are with HT enabled. We are yet to measure the performance with HT 
disabled.
No, I didn't measure I/O time/utilization.

-Original Message-
From: Andrey Korolyov [mailto:and...@xdel.ru]
Sent: Friday, August 29, 2014 1:03 AM
To: Somnath Roy
Cc: Haomai Wang; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS

On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy  wrote:
> Thanks Haomai !
>
> Here is some of the data from my setup.
>
>
>
> --
> --
> --
>
> Set up:
>
> 
>
>
>
> 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and
> data) -> one OSD. 5 client m/c with 12 core cpu and each running two
> instances of ceph_smalliobench (10 clients total). Network is 10GbE.
>
>
>
> Workload:
>
> -
>
>
>
> Small workload – 20K objects with 4K size and io_size is also 4K RR.
> The intent is to serve the ios from memory so that it can uncover the
> performance problems within single OSD.
>
>
>
> Results from Firefly:
>
> --
>
>
>
> Single client throughput is ~14K iops, but as the number of client
> increases the aggregated throughput is not increasing. 10 clients ~15K
> iops. ~9-10 cpu cores are used.
>
>
>
> Result with latest master:
>
> --
>
>
>
> Single client is ~14K iops, but scaling as number of clients
> increases. 10 clients ~107K iops. ~25 cpu cores are used.
>
>
>
> --
> --
> --
>
>
>
>
>
> More realistic workload:
>
> -
>
> Let’s see how it is performing while > 90% of the ios are served from
> disks
>
> Setup:
>
> ---
>
> 40 cpu core server as a cluster node (single node cluster) with 64 GB
> RAM. 8 SSDs -> 8 OSDs. One similar node for monitor and rgw. Another
> node for client running fio/vdbench. 4 rbds are configured with
> ‘noshare’ option. 40 GbE network
>
>
>
> Workload:
>
> 
>
>
>
> 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data.  Io_size = 4K RR.
>
>
>
> Results from Firefly:
>
> 
>
>
>
> Aggregated output while 4 rbd clients stressing the cluster in
> parallel is ~20-25K IOPS , cpu cores used ~8-10 cores (may be less
> can’t remember
> precisely)
>
>
>
> Results from latest master:
>
> 
>
>
>
> Aggregated output while 4 rbd clients stressing the cluster in
> parallel is ~120K IOPS , cpu is 7% idle i.e  ~37-38 cpu cores.
>
>
>
> Hope this helps.
>
>
>

Thanks Roy, the results are very promising!

Just two moments - are numbers from above related to the HT cores or you 
renormalized the result for real ones? And what about percentage of I/O 
time/utilization in this test was (if you measured this ones)?



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Filesystem - Production?

2014-08-29 Thread James Devine
I am running active/standby and it didn't swap over to the standby.  If I
shutdown the active server it swaps to the standby fine though.  When there
were issues, disk access would back up on the webstats servers and a cat of
/sys/kernel/debug/ceph/*/mdsc would have a list of entries whereas normally
it would only list one or two if any.  I have 4 cores and 2GB of ram on the
mds machines.  Watching it right now it is using most of the ram and some
of swap although most of the active ram is disk cache.  I lowered the
memory.swappiness
value to see if that helps.  I'm also logging top output if it happens
again.


On Thu, Aug 28, 2014 at 8:22 PM, Yan, Zheng  wrote:

> On Fri, Aug 29, 2014 at 8:36 AM, James Devine  wrote:
> >
> > On Thu, Aug 28, 2014 at 1:30 PM, Gregory Farnum 
> wrote:
> >>
> >> On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman
> >>  wrote:
> >> > Is Ceph Filesystem ready for production servers?
> >> >
> >> > The documentation says it's not, but I don't see that mentioned
> anywhere
> >> > else.
> >> > http://ceph.com/docs/master/cephfs/
> >>
> >> Everybody has their own standards, but Red Hat isn't supporting it for
> >> general production use at this time. If you're brave you could test it
> >> under your workload for a while and see how it comes out; the known
> >> issues are very much workload-dependent (or just general concerns over
> >> polish).
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > I've been testing it with our webstats since it gets live hits but isn't
> > customer affecting.  Seems the MDS server has problems every few days
> > requiring me to umount and remount the ceph disk to resolve.  Not sure if
> > the issue is resolved in development versions but as of 0.80.5 we seem
> to be
> > hitting it.  I set the log verbosity to 20 so there's tons of logs but
> ends
> > with
>
> The cephfs client is supposed to be able to handle MDS takeover.
> what's symptom makes you umount and remount the cephfs ?
>
> >
> > 2014-08-24 07:10:19.682015 7f2b575e7700 10 mds.0.14  laggy, deferring
> > client_request(client.92141:6795587 getattr pAsLsXsFs #1026dc1)
> > 2014-08-24 07:10:19.682021 7f2b575e7700  5 mds.0.14 is_laggy 19.324963 >
> 15
> > since last acked beacon
> > 2014-08-24 07:10:20.358011 7f2b554e2700 10 mds.0.14 beacon_send up:active
> > seq 127220 (currently up:active)
> > 2014-08-24 07:10:21.515899 7f2b575e7700  5 mds.0.14 is_laggy 21.158841 >
> 15
> > since last acked beacon
> > 2014-08-24 07:10:21.515912 7f2b575e7700 10 mds.0.14  laggy, deferring
> > client_session(request_renewcaps seq 26766)
> > 2014-08-24 07:10:21.515915 7f2b575e7700  5 mds.0.14 is_laggy 21.158857 >
> 15
> > since last acked beacon
> > 2014-08-24 07:10:21.981148 7f2b575e7700 10 mds.0.snap check_osd_map
> > need_to_purge={}
> > 2014-08-24 07:10:21.981176 7f2b575e7700  5 mds.0.14 is_laggy 21.624117 >
> 15
> > since last acked beacon
> > 2014-08-24 07:10:23.170528 7f2b575e7700  5 mds.0.14 handle_mds_map epoch
> 93
> > from mon.0
> > 2014-08-24 07:10:23.175367 7f2b532d5700  0 -- 10.251.188.124:6800/985 >>
> > 10.251.188.118:0/2461578479 pipe(0x5588a80 sd=23 :6800 s=2 pgs=91 cs=1
> l=0
> > c=0x2cbfb20).fault with nothing to send, going to standby
> > 2014-08-24 07:10:23.175376 7f2b533d6700  0 -- 10.251.188.124:6800/985 >>
> > 10.251.188.55:0/306923677 pipe(0x5588d00 sd=22 :6800 s=2 pgs=7 cs=1 l=0
> > c=0x2cbf700).fault with nothing to send, going to standby
> > 2014-08-24 07:10:23.175380 7f2b531d4700  0 -- 10.251.188.124:6800/985 >>
> > 10.251.188.31:0/2854230502 pipe(0x5589480 sd=24 :6800 s=2 pgs=881 cs=1
> l=0
> > c=0x2cbfde0).fault with nothing to send, going to standby
> > 2014-08-24 07:10:23.175438 7f2b534d7700  0 -- 10.251.188.124:6800/985 >>
> > 10.251.188.68:0/2928927296 pipe(0x5588800 sd=21 :6800 s=2 pgs=7 cs=1 l=0
> > c=0x2cbf5a0).fault with nothing to send, going to standby
> > 2014-08-24 07:10:23.184201 7f2b575e7700 10 mds.0.14  my compat
> > compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data}
> > 2014-08-24 07:10:23.184255 7f2b575e7700 10 mds.0.14  mdsmap compat
> > compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap}
> > 2014-08-24 07:10:23.184264 7f2b575e7700 10 mds.-1.-1 map says i am
> > 10.251.188.124:6800/985 mds.-1.-1 state down:dne
> > 2014-08-24 07:10:23.184275 7f2b575e7700 10 mds.-1.-1  peer mds gid 94665
> > removed from map
> > 2014-08-24 07:10:23.184282 7f2b575e7700  1 mds.-1.-1 handle_mds_map i
> > (10.251.188.124:6800/985) dne in the mdsmap, re

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Somnath Roy
Sebastien,

No timeline yet..
Couple of issues here.

1. I don't think FileStore is a solution for SSDs because of its double write. 
WA will be >=2 always.  So far can't find a way to replace Ceph journal for 
filesystem backend.

2. The existing key-value backend like leveldb/Rocksdb also are not flash 
optimized. Yes, Ceph journal will not be needed anymore but I guess compaction 
overhead for Flash will be too much.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Sebastien Han
Sent: Friday, August 29, 2014 3:17 AM
To: ceph-users
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS

Thanks a lot for the answers, even if we drifted from the main subject a little 
bit.
Thanks Somnath for sharing this, when can we expect any codes that might 
improve _write_ performance?

@Mark thanks trying this :)
Unfortunately using nobarrier and another dedicated SSD for the journal  (plus 
your ceph setting) didn't bring much, now I can reach 3,5K IOPS.
By any chance, would it be possible for you to test with a single OSD SSD?

On 28 Aug 2014, at 18:11, Sebastien Han  wrote:

> Hey all,
>
> It has been a while since the last thread performance related on the
> ML :p I've been running some experiment to see how much I can get from an SSD 
> on a Ceph cluster.
> To achieve that I did something pretty simple:
>
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real
> deployment i'll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same
> results
> * no updatedb running
>
> About the box:
>
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn't help here)
>
> The SSD is a 200G Intel DC S3700 and is capable of delivering around
> 29K iops with random 4k writes (my fio results) As a benchmark tool I used 
> fio with the rbd engine (thanks deutsche telekom guys!).
>
> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>
> # du -sh rand.file
> 256Mrand.file
>
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>
> See my ceph.conf:
>
> [global]
>  auth cluster required = cephx
>  auth service required = cephx
>  auth client required = cephx
>  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>  osd pool default pg num = 4096
>  osd pool default pgp num = 4096
>  osd pool default size = 2
>  osd crush chooseleaf type = 0
>
>   debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
>
> [mon]
>  mon osd down out interval = 600
>  mon osd min down reporters = 13
>[mon.ceph-01]
>host = ceph-01
>mon addr = 172.20.20.171
>  [mon.ceph-02]
>host = ceph-02
>mon addr = 172.20.20.172
>  [mon.ceph-03]
>host = ceph-03
>mon addr = 172.20.20.173
>
>debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>debug osd = 0/0
>debug optracker = 0/0
>debug objclass = 0/0
>debug filestore = 0/0
>debug journal = 0/0
>debug ms = 0/0
>debug monc = 0/0
>debug tp = 0/0
>debug auth = 0/0
>debug finisher = 0/0
>debug heartbeatmap = 0/0
>debug perfcounter = 0/0
>debug asok = 0/0
>debug throttle = 0/0
>
> [osd]
>  osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048 osd mount options xfs =
> rw,noatime,logbsize=256k,delaylog  osd journal size = 20480
> cluster_network = 172.20.20.0/24  public_network = 172.20.20.0/24  osd
> mon heartbeat interval = 30  # Performance tuning  filestore merge
> threshold = 40  filestore split multiple = 8  osd op threads = 8  #
> Recovery tuning  osd recovery max active = 1  osd max backfills = 1
> osd recovery op priority = 1
>
>
>debug lockdep = 0/0
>debug context = 0/0
>debug crush = 0/0
>debug buffer = 0/0
>debug timer = 0/0
>debug journaler = 0/0
>

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Stefan Priebe - Profihost AG
Hi Somnath,

we're in the process evaluating sandisk ssds for ceph (fs and journal on each).

8 osds / ssds per host xeon e3 1650

Which one can you recommend?

Greets,
Stefan

Excuse my typo sent from my mobile phone.

> Am 29.08.2014 um 18:33 schrieb Somnath Roy :
> 
> Somnath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole

2014-08-29 Thread Chad Seys
Hi All,
  Does anyone have a script or sequence of commands to prepare all drives on a 
single computer for use by ceph, and then start up all OSDs on the computer at 
one time?
  I feel this would be faster and less network traffic than adding one drive 
at a time, which is what the current script does.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole

2014-08-29 Thread Konrad Gutkowski

Hi,

You could use ceph-deploy.

There wont be any difference in the total amount of data being moved.

W dniu 29.08.2014 o 18:53 Chad Seys  pisze:


Hi All,
  Does anyone have a script or sequence of commands to prepare all  
drives on a
single computer for use by ceph, and then start up all OSDs on the  
computer at

one time?
  I feel this would be faster and less network traffic than adding one  
drive

at a time, which is what the current script does.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Konrad Gutkowski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-29 Thread Gregory Farnum
Hmm, so you've got PGs which are out-of-date on disk (by virtue of being an
older snapshot?) but still have records of them being newer in the OSD
journal?
That's a new failure node for me and I don't think we have any tools
designed for solving. If you can *back up the disk* before doing this, I
think I'd try flushing the OSD journal, then zapping it and creating a new
one.
You'll probably still have an inconsistent store, so you'll then need to,
for each incomplete pg (and for every OSD at the same point on it) use the
OSD store tool extract the "pginfo" entry and rewrite it to the versions
which the pg actually is on disk.

These are uncharted waters and you're going to need to explore some dev
tools; good luck!
(Unless Sam or somebody has a better solution.)
-Greg

On Thursday, August 28, 2014, John Morris  wrote:

> Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
> thread.  They were enough to get PGs 'incomplete' due to 'not enough
> OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
> to write a full post-mortem (embarrassing as it will be) after the
> cluster is fully healthy.
>
> As is my kind of luck, the cluster *also* suffers from eight of the
> *other* 'missing log' sort of 'incomplete' PGs:
>
> 2014-08-28 23:42:03.350612 7f1cc9d82700 20 osd.1 pg_epoch: 13085
>   pg[2.5c( v 10143'300715 (9173'297714,10143'300715] local-les=11890
>   n=3404 ec=1 les/c 11890/10167 12809/12809/12809) [7,3,0,4]
>   r=-1 lpr=12809 pi=8453-12808/138 lcod 0'0
>   inactive NOTIFY]
>   handle_activate_map: Not dirtying info:
>   last_persisted is 13010 while current is 13085
>
> The data clearly exists in the renegade OSD's data directory.  There
> are no reported 'unfound' objects, so 'mark_unfound_lost revert' doesn't
> apply.  No apparently useful data from 'ceph pg ... query', but an
> example is pasted below.
>
> Since the beginning of the cluster rebuild, all ceph clients have been
> turned off, so I believe there's no fear of lost data by reverting to
> these PGs, and besides there're always backup tapes.
>
> How can Ceph be told to accept the versions it sees on osd.1 as the
> most current version, and forget the missing log history?
>
>John
>
>
> $ ceph pg 2.5c query
> { "state": "incomplete",
>   "epoch": 13241,
>   "up": [
> 7,
> 1,
> 0,
> 4],
>   "acting": [
> 7,
> 1,
> 0,
> 4],
>   "info": { "pgid": "2.5c",
>   "last_update": "10143'300715",
>   "last_complete": "10143'300715",
>   "log_tail": "9173'297715",
>   "last_backfill": "0\/\/0\/\/-1",
>   "purged_snaps": "[]",
>   "history": { "epoch_created": 1,
>   "last_epoch_started": 11890,
>   "last_epoch_clean": 10167,
>   "last_epoch_split": 0,
>   "same_up_since": 13229,
>   "same_interval_since": 13229,
>   "same_primary_since": 13118,
>   "last_scrub": "10029'298459",
>   "last_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "last_deep_scrub": "8323'284793",
>   "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
>   "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649"},
>   "stats": { "version": "10143'300715",
>   "reported_seq": "1764",
>   "reported_epoch": "13241",
>   "state": "incomplete",
>   "last_fresh": "2014-08-29 01:35:44.196909",
>   "last_change": "2014-08-29 01:22:50.298880",
>   "last_active": "0.00",
>   "last_clean": "0.00",
>   "last_became_active": "0.00",
>   "last_unstale": "2014-08-29 01:35:44.196909",
>   "mapping_epoch": 13223,
>   "log_start": "9173'297715",
>   "ondisk_log_start": "9173'297715",
>   "created": 1,
>   "last_epoch_clean": 10167,
>   "parent": "0.0",
>   "parent_split_bits": 0,
>   "last_scrub": "10029'298459",
>   "last_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "last_deep_scrub": "8323'284793",
>   "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
>   "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "log_size": 3000,
>   "ondisk_log_size": 3000,
>   "stats_invalid": "0",
>   "stat_sum": { "num_bytes": 0,
>   "num_objects": 0,
>   "num_object_clones": 0,
>   "num_object_copies": 0,
>   "num_objects_missing_on_primary": 0,
>   "num_objects_degraded": 0,
>   "num_objects_unfound": 0,
>   "num_read": 0,
>   "num_read_kb": 0,
>   "num_write": 0,
>   "num_write_kb": 0,
>   "num_scrub_errors": 0,
>   "num_shallow_scrub_errors": 0,
>   "num_deep_scrub_errors": 0,
>   "num_objects_recovered": 0,
>   "num_bytes_recovered": 0,
>   "num_keys_recovered": 0},
>   "stat

Re: [ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole

2014-08-29 Thread Olivier DELHOMME
Hello,

- Mail original -
> De: "Chad Seys" 
> À: ceph-users@lists.ceph.com
> Envoyé: Vendredi 29 Août 2014 18:53:19
> Objet: [ceph-users] script for commissioning a node with multiple osds,   
> added to cluster as a whole
> 
> Hi All,
>   Does anyone have a script or sequence of commands to prepare all drives on
>   a
> single computer for use by ceph, and then start up all OSDs on the computer
> at
> one time?
>   I feel this would be faster and less network traffic than adding one drive
> at a time, which is what the current script does.

You can use puppet :

https://github.com/cernceph/puppet-ceph
https://github.com/enovance/puppet-ceph

You may also have a look here :

https://github.com/cernceph

Cheers,

Olivier.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about monitor and paxos relationship

2014-08-29 Thread Gregory Farnum
On Thu, Aug 28, 2014 at 9:52 PM, pragya jain  wrote:
> I have some basic question about monitor and paxos relationship:
>
> As the documents says, Ceph monitor contains cluster map, if there is any
> change in the state of the cluster, the change is updated in the cluster
> map. monitor use paxos algorithm to create the consensus among monitors to
> establish a quorum.
> And when we talk about the Paxos algorithm, documents says that monitor
> writes all changes to the Paxos instance and Paxos writes the changes to a
> key/value store for strong consistency.
>
> #1: I am unable to understand what actually the Paxos algorithm do? all
> changes in the cluster map are made by Paxos algorithm? how it create a
> consensus among monitors

Paxos is an algorithm for making decisions and/or safely replicating
data in a distributed system. The Ceph monitor cluster uses it for all
changes to any of its data.

> My assumption is: cluster map is updated when OSD report monitor about any
> changes, there is no role of Paxos in it. Paxos write changes made only for
> the monitors. Please somebody elaborate at this point.

Every change the monitors incorporate to any data structure, most
definitely including the OSD map's changes based on reports from OSDs,
is passed through paxos.

> #2: why odd no. of monitors are recommended for production cluster, not even
> no.?

This is because of a trait of the Paxos' systems durability and uptime
guarantees.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about monitor and paxos relationship

2014-08-29 Thread J David
On Fri, Aug 29, 2014 at 12:52 AM, pragya jain  wrote:
> #2: why odd no. of monitors are recommended for production cluster, not even
> no.?

Because to achieve a quorum, you must always have participation of
more than 50% of the monitors.  Not 50%.  More than 50%.  With an even
number of monitors, half is not a quorum so you need half + 1.  With
an odd number of monitors, there's no such thing as half.  So with an
even number of monitors, one is always "wasted."

1 monitors -> 1 to make quorum -> 0 can be lost
2 monitors -> 2 to make quorum -> 0 can be lost
3 monitors -> 2 to make quorum -> 1 can be lost
4 monitors -> 3 to make quorum -> 1 can be lost
5 monitors -> 3 to make quorum -> 2 can be lost
6 monitors -> 4 to make quorum -> 2 can be lost
7 monitors -> 4 to make quorum -> 3 can be lost

So an even number N of monitors doesn't give you any better fault
resilience than N-1 monitors.  And the more monitors you have, the
more traffic there is between them.  So when N is even, N monitors
consume more resources and provide no extra benefit compared to N-1
monitors.

Hope that helps!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about monitor and paxos relationship

2014-08-29 Thread Joao Eduardo Luis

On 08/29/2014 11:22 PM, J David wrote:

So an even number N of monitors doesn't give you any better fault
resilience than N-1 monitors.  And the more monitors you have, the
more traffic there is between them.  So when N is even, N monitors
consume more resources and provide no extra benefit compared to N-1
monitors.


Except for more copies ;)

But yeah, if you're going with 2 or 4, you'll be better off with 3 or 5. 
 As long as you don't go with 1 you should be okay.  Only go with 1 if 
you're truly okay with losing whatever you're storing if that one 
monitor's disk is fried.


  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com