Re: [ceph-users] Cache Tiering Question

2015-10-20 Thread Nick Fisk
I think what also makes things seem a little disconnected is that the 
target_max_bytes and relative levels are at the pool level, however I Think the 
current eviction logic works at a per OSD/PG level and so these values are 
calculated into estimates per OSD. This means that that depending on how the 
stale objects are situated you can end up with a situation where it looks like 
flushing/eviction is not not strictly sticking to the entered values.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 16 October 2015 00:50
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cache Tiering Question
> 
> 
> Hello,
> 
> Having run into this myself two days ago (setting relative sizing values 
> doesn't
> flush things when expected) I'd say that the documentation is highly
> misleading when it comes to the relative settings.
> 
> And unclear when it comes to the size/object settings.
> 
> Guess this section needs at least one nice red paragraph and some further
> explanations.
> 
> Christian
> 
> On Thu, 15 Oct 2015 17:33:30 -0600 Robert LeBlanc wrote:
> 
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > One more question. Is max_{bytes,objects} before or after replication
> > factor?
> > - 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Thu, Oct 15, 2015 at 4:42 PM, LOPEZ Jean-Charles  wrote:
> > > Hi Robert,
> > >
> > > yes they do.
> > >
> > > Pools don’t have a size when you create them hence the couple
> > > value/ratio that is to be defined for cache tiering mechanism. Pool
> > > only have a number of PGs assigned. So setting the max values and
> > > the ratios for dirty and full must be set explicitly to match your
> > > configuration.
> > >
> > > Note that you can at the same time define max_bytes and max_objects.
> > > The first of the 2 values that breaches using your ratio settings
> > > will trigger eviction and/or flushing. The ratios you choose apply
> > > to both values.
> > >
> > > Cheers
> > > JC
> > >
> > >> On 15 Oct 2015, at 15:02, Robert LeBlanc  wrote:
> > >>
> > >> -BEGIN PGP SIGNED MESSAGE-
> > >> Hash: SHA256
> > >>
> > >> hmmm...
> > >>
> > >> http://docs.ceph.com/docs/master/rados/operations/cache-
> tiering/#re
> > >> lative-sizing
> > >>
> > >> makes it sound like it should be based on the size of the pool and
> > >> that you don't have to set anything like max bytes/objects. Can you
> > >> confirm that cache_target_{dirty,dirty_high,full}_ratio works as a
> > >> ratio of target_max_bytes set?
> > >> - 
> > >> Robert LeBlanc
> > >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >>
> > >>
> > >> On Thu, Oct 15, 2015 at 3:32 PM, Nick Fisk  wrote:
> > >>>
> > >>>
> > >>>
> > >>>
> >  -Original Message-
> >  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >  Behalf Of Robert LeBlanc
> >  Sent: 15 October 2015 22:06
> >  To: ceph-users@lists.ceph.com
> >  Subject: [ceph-users] Cache Tiering Question
> > 
> >  -BEGIN PGP SIGNED MESSAGE-
> >  Hash: SHA256
> > 
> >  ceph df (ceph version 0.94.3-252-g629b631
> >  (629b631488f044150422371ac77dfc005f3de1bc)) is showing some odd
> >  results:
> > 
> >  root@nodez:~# ceph df
> >  GLOBAL:
> > SIZE   AVAIL  RAW USED %RAW USED
> > 24518G 21670G1602G  6.53
> >  POOLS:
> > NAME ID USED  %USED MAX AVAIL OBJECTS
> > rbd  0  2723G 11.11 6380G 1115793
> > ssd-pool 2  0 0  732G   1
> > 
> >  The rbd pool is showing 11.11% used, but if you calculate the
> >  numbers
> > >>> there
> >  it is 2723/6380=42.68%.
> > >>>
> > >>> I have a feeling that the percentage is based on the amount used
> > >>> of the total cluster size. Ie 2723/24518
> > >>>
> > 
> >  Will this cause problems with the relative cache tier settings?
> >  Do I need
> > >>> to set
> >  the percentage based on what Ceph is reporting here?
> > >>>
> > >>> The flushing/eviction thresholds are based on the target_max_bytes
> > >>> number that you set, they have nothing to do with the underlying
> > >>> pool size. It's up to you to come up with a sane number for this
> > >>> variable.
> > >>>
> > 
> >  Thanks,
> >  - 
> >  Robert LeBlanc
> >  PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
> >  B9F1
> >  - BEGIN PGP SIGNATURE-
> >  Version: Mailvelope v1.2.0
> >  Comment: https://www.mailvelope.com
> > 
> > 
> wsFcBAEBCAAQBQJWIBVGCRDmVDuy+mK58QAAXEYQAKm5IBGn81Hlb9az4
> >  52x
> > 
> hSH6onk7mJE7L2s5FnoJv2sNW4azhDEVKGQBE9vvhIVBhhtKtnqdzu3ytk6E
> > 
> EUFuPBzUWLJyG3wQtp3QC0PdYzlGkS7bowdpZqk9PdaYZYg

[ceph-users] planet.ceph.com

2015-10-20 Thread Luis Periquito
Hi,

I was looking for some ceph resources and saw a reference to
planet.ceph.com. However when I opened it I was sent to a dental
clinic (?). That doesn't sound right, does it?

I was at this page when I saw the reference...

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-20 Thread hzwuli...@gmail.com
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine.
Here is my test situation:
1. ssd pool  (9 OSD servers with 2 osds on each server, 10Gb networks for 
public & cluster networks)
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host
4. disable rbd cache
5. fio test on the two volues:
[global]
rw=randwrite
bs=4k
ioengine=libaio
iodepth=64
direct=1
size=64g
runtime=300s
group_reporting=1
thread=1

volume1 got about 24k IOPS and volume got about 14k IOPS.

We could see performance of volume2 is not good compare to volume1, so is it 
normal behabior of guest host?
If not, what maybe the problem?

Thanks!


hzwuli...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
On Tue, Oct 20, 2015 at 3:26 AM, Haomai Wang  wrote:
> The fact is that journal could help a lot for rbd use cases,
> especially for small ios. I don' t think it will be bottleneck. If we
> just want to reduce double write, it doesn't solve any performance
> problem.
>

One trick I've been using in my ceph clusters is hiding a slow write
backend behind a fast journal device. The write performance will be of
the fast (and small) journal device. This only helps on write, but it
can make a huge difference.

I've even made some tests showing (within 10%, RBD and S3) that the
backend device doesn't matter and the write performance is exactly the
same as that of the journal device fronting all the writes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Jan Schermer

> On 20 Oct 2015, at 10:34, Luis Periquito  wrote:
> 
> On Tue, Oct 20, 2015 at 3:26 AM, Haomai Wang  wrote:
>> The fact is that journal could help a lot for rbd use cases,
>> especially for small ios. I don' t think it will be bottleneck. If we
>> just want to reduce double write, it doesn't solve any performance
>> problem.
>> 

Yes, in theory journal sounds like an excellent idea, but one would expect to 
have performance somewhere between an SSD journal device and a HDD filestore 
when trying that use case. In practice, you get close to good HDD performance 
for small writes, like databases do.
The complexity just needs to drop if it has to compete with anything...

> 
> One trick I've been using in my ceph clusters is hiding a slow write
> backend behind a fast journal device. The write performance will be of
> the fast (and small) journal device. This only helps on write, but it
> can make a huge difference.
> 

Do you mean an external filesystem journal? What filesystem? ext4/xfs?
I tried that on a physical machine and it worked wonders with both of them, 
even though data wasn't journaled and hit the platters - I don't yet understand 
how that was possible but the benchmark just flew.

Jan

> I've even made some tests showing (within 10%, RBD and S3) that the
> backend device doesn't matter and the write performance is exactly the
> same as that of the journal device fronting all the writes.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
>> One trick I've been using in my ceph clusters is hiding a slow write
>> backend behind a fast journal device. The write performance will be of
>> the fast (and small) journal device. This only helps on write, but it
>> can make a huge difference.
>>
>
> Do you mean an external filesystem journal? What filesystem? ext4/xfs?
> I tried that on a physical machine and it worked wonders with both of them, 
> even though data wasn't journaled and hit the platters - I don't yet 
> understand how that was possible but the benchmark just flew.
>

I just have a raw partition in the journal device (SSD) and point "osd
journal" to that block device (something like "osd journal =
/dev/vgsde/journal-8"). So no filesystem in the journal device. Then
the osd data is in a local HDD using normal XFS filesystem.

To help this I do have usually big amounts of RAM (average 6G per
OSD), so the buffered writes to the spindle can take it's time to
flush.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Jan Schermer

> On 20 Oct 2015, at 11:28, Luis Periquito  wrote:
> 
>>> One trick I've been using in my ceph clusters is hiding a slow write
>>> backend behind a fast journal device. The write performance will be of
>>> the fast (and small) journal device. This only helps on write, but it
>>> can make a huge difference.
>>> 
>> 
>> Do you mean an external filesystem journal? What filesystem? ext4/xfs?
>> I tried that on a physical machine and it worked wonders with both of them, 
>> even though data wasn't journaled and hit the platters - I don't yet 
>> understand how that was possible but the benchmark just flew.
>> 
> 
> I just have a raw partition in the journal device (SSD) and point "osd
> journal" to that block device (something like "osd journal =
> /dev/vgsde/journal-8"). So no filesystem in the journal device. Then
> the osd data is in a local HDD using normal XFS filesystem.
> 
> To help this I do have usually big amounts of RAM (average 6G per
> OSD), so the buffered writes to the spindle can take it's time to
> flush.

Oh. I think that's a pretty normal scenario actually.

Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-hammer and debian jessie - missing files on repository

2015-10-20 Thread Björn Lässig

Hi,

Thanks guys for supporting the latest debian stable release with latest 
ceph stable!


as version 0.94.4. has been released, i tried to upgrade my 
debian/jessie cluster with hammer/wheezy packages to hammer/jessie.


Unfortunately the download.ceph.com/debian-hammer debian repository is 
in some strange state. (even, if you ignore, that the ipv6 connection is 
very flaky from different sites in europe)


eg:

in
http://download.ceph.com/debian-hammer/dists/jessie/main/binary-amd64/Packages
a package called ceph-common is referenced

- /
Package: ceph-common
Version: 0.94.4-1~bpo80+1
Architecture: amd64
Filename: pool/main/c/ceph/ceph-common_0.94.4-1~bpo80+1_amd64.deb
--\

but this package file does not exists. All files whith version ~bpo80 do 
not exist. (yet?)



could you check this please?

Thanks in advance
Björn Lässig

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang
Hi Guys,

I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with rocksdb 3.11 
as OSD backend. I use rbd to test performance and following is my cluster info.

[ceph@xxx ~]$ ceph -s
    cluster b74f3944-d77f-4401-a531-fa5282995808
     health HEALTH_OK
     monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
            election epoch 1, quorum 0 xxx
     osdmap e338: 44 osds: 44 up, 44 in
            flags sortbitwise
      pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
            1940 MB used, 81930 GB / 81932 GB avail
                2048 active+clean

All the disks are spinning ones with write cache turning on. Rocksdb's WAL and 
sst files are on the same disk as every OSD.

Using fio to generate following write load: 
fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
-directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  

Test result:
WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
~25 IOPS.

I tuned some other rocksdb options, but with no lock. I tracked down the 
rocksdb code and found each writer's Sync operation would take ~30ms to finish. 
And as shown above, it is strange that performance has no much difference no 
matters disk write cache is on or off.

Do your guys encounter the similar issue? Or do I miss something to cause 
rocksdb's poor write performance?

Thanks.
Zhi Zhang (David) 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-20 Thread Alexandre DERUMIER
Hi,

I'm able to reach around same performance with qemu-librbd vs qemu-krbd,
when I compile qemu with jemalloc
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)

on my test, librbd with jemalloc still use 2x more cpu than krbd,
so cpu could be bottleneck too.

with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd


- Mail original -
De: hzwuli...@gmail.com
À: "ceph-users" 
Envoyé: Mardi 20 Octobre 2015 10:22:33
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 

volume1 got about 24k IOPS and volume got about 14k IOPS. 

We could see performance of volume2 is not good compare to volume1, so is it 
normal behabior of guest host? 
If not, what maybe the problem? 

Thanks! 

hzwuli...@gmail.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Z Zhang wrote:
> Hi Guys,
> 
> I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> is my cluster info.
> 
> [ceph@xxx ~]$ ceph -s
>     cluster b74f3944-d77f-4401-a531-fa5282995808
>      health HEALTH_OK
>      monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
>             election epoch 1, quorum 0 xxx
>      osdmap e338: 44 osds: 44 up, 44 in
>             flags sortbitwise
>       pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
>             1940 MB used, 81930 GB / 81932 GB avail
>                 2048 active+clean
> 
> All the disks are spinning ones with write cache turning on. Rocksdb's 
> WAL and sst files are on the same disk as every OSD.

Are you using the KeyValueStore backend?

> Using fio to generate following write load: 
> fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
> -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> 
> Test result:
> WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
> ~25 IOPS.
> 
> I tuned some other rocksdb options, but with no lock.

The wip-newstore-frags branch sets some defaults for rocksdb that I think 
look pretty reasonable (at least given how newstore is using rocksdb).

> I tracked down the rocksdb code and found each writer's Sync operation 
> would take ~30ms to finish. And as shown above, it is strange that 
> performance has no much difference no matters disk write cache is on or 
> off.
> 
> Do your guys encounter the similar issue? Or do I miss something to 
> cause rocksdb's poor write performance?

Yes, I saw the same thing.  This PR addresses the problem and is nearing 
merge upstream:

https://github.com/facebook/rocksdb/pull/746

There is also an XFS performance bug that is contributing to the problem, 
but it looks like Dave Chinner just put together a fix for that.

But... we likely won't be using KeyValueStore in its current form over 
rocksdb (or any other kv backend).  It stripes object data over key/value 
pairs, which IMO is not the best approach.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread German Anders
trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
error msg while trying to restart the mon daemons:

2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
Operation not permitted


any ideas?

$ ceph -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)


Thanks in advance,

Cheers,

*German* 

2015-10-19 18:07 GMT-03:00 Sage Weil :

> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to
> Infernalis. That is, all users of earlier version of Hammer or any
> version of Firefly will first need to upgrade to hammer v0.94.4 or
> later before upgrading to Infernalis (or future releases).
>
> All v0.94.x Hammer users are strongly encouraged to upgrade.
>
> Changes
> ---
>
> * build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166,
> Nathan Cutler)
> * build/ops: ceph.spec.in: ceph-common needs python-argparse on older
> distros, but doesn't require it (#12034, Nathan Cutler)
> * build/ops: ceph.spec.in: radosgw requires apache for SUSE only -- makes
> no sense (#12358, Nathan Cutler)
> * build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized
> (#11991, Nathan Cutler)
> * build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992,
> Owen Synge)
> * build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan
> Cutler)
> * build/ops: ceph.spec.in: snappy-devel for all supported distros
> (#12361, Nathan Cutler)
> * build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel
> (#11629, Nathan Cutler)
> * build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build
> (#12351, Nathan Cutler)
> * build/ops: error in ext_mime_map_init() when /etc/mime.types is missing
> (#11864, Ken Dreyer)
> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
> (#11798, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#10927, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#11140, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#11686, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#12407, Sage Weil)
> * cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu Chai)
> * cli: ceph tell: broken error message / misleading hinting (#11101, Kefu
> Chai)
> * common: arm: all programs that link to librados2 hang forever on startup
> (#12505, Boris Ranto)
> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects
> (#13070, Sage Weil)
> * common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775,
> Xinze Chi)
> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
> * common: Memory leak in Mutex.cc, pthread_mutexattr_init without
> pthread_mutexattr_destroy (#11762, Ketor Meng)
> * common: object_map_update fails with -EINVAL return code (#12611, Jason
> Dillaman)
> * common: Pipe: Drop connect_seq increase line (#13093, Haomai Wang)
> * common: recursive lock of md_config_t (0) (#12614, Josh Durgin)
> * crush: ceph osd crush reweight-subtree does not reweight parent node
> (#11855, Sage Weil)
> * doc: update docs to point to download.ceph.com (#13162, Alfredo Deza)
> * fs: ceph-fuse 0.94.2-1trusty segfaults / aborts (#12297, Greg Farnum)
> * fs: segfault launching ceph-fuse with bad --name (#12417, John Spray)
> * librados: Change radosgw pools default crush ruleset (#11640, Yuan Zhou)
> * librbd: correct issues discovered via lockdep / helgrind (#12345, Jason
> Dillaman)
> * librbd: Crash during TestInternal.MultipleResize (#12664, Jason Dillaman)
> * librbd: deadlock during cooperative exclusive lock transition (#11537,
> Jason Dillaman)
> * librbd: Possible crash while concurrently writing and shrinking an image
> (#11743, Jason Dillaman)
> * mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai)
> * mon: fix crush testing for new pools (#13400, Sage Weil)
> * mon: get pools health'info have error (#12402, renhwztetecs)
> * mon: implicit erasure code crush 

Re: [ceph-users] How ceph client abort IO

2015-10-20 Thread Jason Dillaman
There is no such interface currently on the librados / OSD side to abort IO 
operations.  Can you provide some background on your use-case for aborting 
in-flight IOs?

-- 

Jason Dillaman 


- Original Message - 

> From: "min fang" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, October 19, 2015 6:41:40 PM
> Subject: [ceph-users] How ceph client abort IO

> Can librbd interface provide abort api for aborting IO? If yes, can the abort
> interface detach write buffer immediately? I hope can reuse the write buffer
> quickly after issued the abort request, while not waiting IO aborted in osd
> side.

> thanks.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How ceph client abort IO

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Jason Dillaman wrote:
> There is no such interface currently on the librados / OSD side to abort 
> IO operations.  Can you provide some background on your use-case for 
> aborting in-flight IOs?

The internal Objecter has a cancel interface, but it can't yank back 
buffers, and it's not exposed to librados.

But... if you're using librados or librbd then I think we're making a copy 
of the buffer anyway so you can reuse it as soon as the IO is submitted.  
Unless you're using the C++ librados API and submitting a bufferlist?

sage


> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message - 
> 
> > From: "min fang" 
> > To: ceph-users@lists.ceph.com
> > Sent: Monday, October 19, 2015 6:41:40 PM
> > Subject: [ceph-users] How ceph client abort IO
> 
> > Can librbd interface provide abort api for aborting IO? If yes, can the 
> > abort
> > interface detach write buffer immediately? I hope can reuse the write buffer
> > quickly after issued the abort request, while not waiting IO aborted in osd
> > side.
> 
> > thanks.
> 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd export hangs / does nothing without regular drop_cache

2015-10-20 Thread Jason Dillaman
Can you provide more details on your setup and how you are running the rbd 
export?  If clearing the pagecache, dentries, and inodes solves the issue, it 
sounds like it's outside of Ceph (unless you are exporting to a CephFS or krbd 
mount point).

-- 

Jason Dillaman 


- Original Message - 

> From: "Stefan Priebe - Profihost AG" 
> To: ceph-us...@ceph.com
> Sent: Saturday, October 17, 2015 2:02:37 PM
> Subject: [ceph-users] rbd export hangs / does nothing without regular
> drop_cache

> Hi,

> I've a machine doing rbd backup with the rbd export command. Around 10tb each
> day. After some days it starts to get very slowly and the rbd commands seem
> to do nothing. If I run echo 3 > drop_caches they starts immediately again
> to get traffic and dumping data.

> What's wrong here?

> Kernel 4.1.10
> ceph firefly

> Greets,
> Stefan

> Excuse my typo s ent from my mobile phone.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] too many kworker processes after upgrade to 0.94.3

2015-10-20 Thread Andrei Mikhailovsky
Hello 

I've recently upgraded my ceph cluster from 0.94.1 to 0.94.3 and noticed that 
after about a day i started getting the emails from our network/host monitoring 
system. The notifications were that there are too many processes on the osd 
servers. I've not seen this before and I am running ceph for good part of three 
years now. Taking a look at the osd server revealed a large number of kworker 
processes. The number fluctuates between about 200 and 700, but generally stays 
around 400+ mark. 

I've restarted the osd processes, which has dropped the number of kworker 
processes to about 60, but after about 20-30 minutes they were back up in the 
600 region. 

Has anyone noticed a similar behavior? How can I determine what's causing such 
a high number kworkers? 

There was no other change to the OS apart from the ceph upgrade. The kernel and 
system libraries remained the same. 

Thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Haomai Wang
On Tue, Oct 20, 2015 at 8:47 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, Z Zhang wrote:
>> Hi Guys,
>>
>> I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with
>> rocksdb 3.11 as OSD backend. I use rbd to test performance and following
>> is my cluster info.
>>
>> [ceph@xxx ~]$ ceph -s
>> cluster b74f3944-d77f-4401-a531-fa5282995808
>>  health HEALTH_OK
>>  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
>> election epoch 1, quorum 0 xxx
>>  osdmap e338: 44 osds: 44 up, 44 in
>> flags sortbitwise
>>   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
>> 1940 MB used, 81930 GB / 81932 GB avail
>> 2048 active+clean
>>
>> All the disks are spinning ones with write cache turning on. Rocksdb's
>> WAL and sst files are on the same disk as every OSD.
>
> Are you using the KeyValueStore backend?
>
>> Using fio to generate following write load:
>> fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
>> -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1
>>
>> Test result:
>> WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
>> WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
>> ~25 IOPS.
>>
>> I tuned some other rocksdb options, but with no lock.
>
> The wip-newstore-frags branch sets some defaults for rocksdb that I think
> look pretty reasonable (at least given how newstore is using rocksdb).
>
>> I tracked down the rocksdb code and found each writer's Sync operation
>> would take ~30ms to finish. And as shown above, it is strange that
>> performance has no much difference no matters disk write cache is on or
>> off.
>>
>> Do your guys encounter the similar issue? Or do I miss something to
>> cause rocksdb's poor write performance?
>
> Yes, I saw the same thing.  This PR addresses the problem and is nearing
> merge upstream:
>
> https://github.com/facebook/rocksdb/pull/746
>

cool, it looks reasonable for performance degraded

> There is also an XFS performance bug that is contributing to the problem,

are you refer to
this(http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/27645)?

I think newstore also meet this situation.

> but it looks like Dave Chinner just put together a fix for that.
>
> But... we likely won't be using KeyValueStore in its current form over
> rocksdb (or any other kv backend).  It stripes object data over key/value
> pairs, which IMO is not the best approach.
>
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang
Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer 
look.

Yes, I am trying KVStore backend. The reason we are trying it is that few user 
doesn't have such high requirement on data loss occasionally. It seems KVStore 
backend without synchronized WAL could achieve better performance than 
filestore. And only data still in page cache would get lost on machine 
crashing, not process crashing, if we use WAL but no synchronization. What do 
you think?


Thanks.
Zhi Zhang (David)

Date: Tue, 20 Oct 2015 05:47:44 -0700
From: s...@newdream.net
To: zhangz.da...@outlook.com
CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore

On Tue, 20 Oct 2015, Z Zhang wrote:
> Hi Guys,
> 
> I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> is my cluster info.
> 
> [ceph@xxx ~]$ ceph -s
> cluster b74f3944-d77f-4401-a531-fa5282995808
>  health HEALTH_OK
>  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> election epoch 1, quorum 0 xxx
>  osdmap e338: 44 osds: 44 up, 44 in
> flags sortbitwise
>   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> 1940 MB used, 81930 GB / 81932 GB avail
> 2048 active+clean
> 
> All the disks are spinning ones with write cache turning on. Rocksdb's 
> WAL and sst files are on the same disk as every OSD.
 
Are you using the KeyValueStore backend?
 
> Using fio to generate following write load: 
> fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
> -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> 
> Test result:
> WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
> ~25 IOPS.
> 
> I tuned some other rocksdb options, but with no lock.
 
The wip-newstore-frags branch sets some defaults for rocksdb that I think 
look pretty reasonable (at least given how newstore is using rocksdb).
 
> I tracked down the rocksdb code and found each writer's Sync operation 
> would take ~30ms to finish. And as shown above, it is strange that 
> performance has no much difference no matters disk write cache is on or 
> off.
> 
> Do your guys encounter the similar issue? Or do I miss something to 
> cause rocksdb's poor write performance?
 
Yes, I saw the same thing.  This PR addresses the problem and is nearing 
merge upstream:
 
https://github.com/facebook/rocksdb/pull/746
 
There is also an XFS performance bug that is contributing to the problem, 
but it looks like Dave Chinner just put together a fix for that.
 
But... we likely won't be using KeyValueStore in its current form over 
rocksdb (or any other kv backend).  It stripes object data over key/value 
pairs, which IMO is not the best approach.
 
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Haomai Wang
Actually keyvaluestore would submit transaction with sync flag
too(rely to keyvaluedb impl journal/logfile).

Yes, if we disable sync flag, keyvaluestore's performance will
increase a lot. But we dont provide with this option now

On Tue, Oct 20, 2015 at 9:22 PM, Z Zhang  wrote:
> Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer
> look. Yes, I am trying KVStore backend. The reason we are trying it is that
> few user doesn't have such high requirement on data loss occasionally. It
> seems KVStore backend without synchronized WAL could achieve better
> performance than filestore. And only data still in page cache would get lost
> on machine crashing, not process crashing, if we use WAL but no
> synchronization. What do you think? ? ? Thanks. Zhi Zhang (David) Date: Tue,
> 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To:
> zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com;
> ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Write performance issue
> under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I
> am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb
> 3.11 as OSD backend. I use rbd to test performance and following > is my
> cluster info. > > [ceph@xxx ~]$ ceph -s > ? ? cluster
> b74f3944-d77f-4401-a531-fa5282995808 > ? ? ?health HEALTH_OK > ? ? ?monmap
> e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > ? ? ? ? ? ? election epoch 1,
> quorum 0 xxx > ? ? ?osdmap e338: 44 osds: 44 up, 44 in > ? ? ? ? ? ? flags
> sortbitwise > ? ? ? pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
>> ? ? ? ? ? ? 1940 MB used, 81930 GB / 81932 GB avail > ? ? ? ? ? ? ? ? 2048
> active+clean > > All the disks are spinning ones with write cache turning
> on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you
> using the KeyValueStore backend? > Using fio to generate following write
> load:? > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K
> -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1?? > > Test
> result: > WAL enabled + sync: false + disk write cache: on ?will get ~700
> IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off ?will
> get only ~25 IOPS. > > I tuned some other rocksdb options, but with no lock.
> The wip-newstore-frags branch sets some defaults for rocksdb that I think
> look pretty reasonable (at least given how newstore is using rocksdb). > I
> tracked down the rocksdb code and found each writer's Sync operation > would
> take ~30ms to finish. And as shown above, it is strange that > performance
> has no much difference no matters disk write cache is on or > off. > > Do
> your guys encounter the similar issue? Or do I miss something to > cause
> rocksdb's poor write performance? Yes, I saw the same thing. This PR
> addresses the problem and is nearing merge upstream:
> https://github.com/facebook/rocksdb/pull/746 There is also an XFS
> performance bug that is contributing to the problem, but it looks like Dave
> Chinner just put together a fix for that. But... we likely won't be using
> KeyValueStore in its current form over rocksdb (or any other kv backend). It
> stripes object data over key/value pairs, which IMO is not the best
> approach. sage ___ ceph-users
> mailing list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Z Zhang wrote:
> Thanks, Sage, for pointing out the PR and ceph branch. I will take a 
> closer look.
> 
> Yes, I am trying KVStore backend. The reason we are trying it is that 
> few user doesn't have such high requirement on data loss occasionally. 
> It seems KVStore backend without synchronized WAL could achieve better 
> performance than filestore. And only data still in page cache would get 
> lost on machine crashing, not process crashing, if we use WAL but no 
> synchronization. What do you think?

That sounds dangerous.  The OSDs are recording internal metadata about the 
cluster (peering, replication, etc.)... even if you don't care so much 
about recent user data writes you probably don't want to risk breaking 
RADOS itself.  If the kv backend is giving you a stale point-in-time 
consistent copy it's not so bad, but in a power-loss event it could give 
you problems...

sage

> 
> Thanks. Zhi Zhang (David)
> 
> Date: Tue, 20 Oct 2015 05:47:44 -0700
> From: s...@newdream.net
> To: zhangz.da...@outlook.com
> CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> On Tue, 20 Oct 2015, Z Zhang wrote:
> > Hi Guys,
> > 
> > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> > rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> > is my cluster info.
> > 
> > [ceph@xxx ~]$ ceph -s
> > cluster b74f3944-d77f-4401-a531-fa5282995808
> >  health HEALTH_OK
> >  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> > election epoch 1, quorum 0 xxx
> >  osdmap e338: 44 osds: 44 up, 44 in
> > flags sortbitwise
> >   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> > 1940 MB used, 81930 GB / 81932 GB avail
> > 2048 active+clean
> > 
> > All the disks are spinning ones with write cache turning on. Rocksdb's 
> > WAL and sst files are on the same disk as every OSD.
>  
> Are you using the KeyValueStore backend?
>  
> > Using fio to generate following write load: 
> > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K 
> > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> > 
> > Test result:
> > WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> > WAL enabled + sync: true (default) + disk write cache: on|off  will get 
> > only ~25 IOPS.
> > 
> > I tuned some other rocksdb options, but with no lock.
>  
> The wip-newstore-frags branch sets some defaults for rocksdb that I think 
> look pretty reasonable (at least given how newstore is using rocksdb).
>  
> > I tracked down the rocksdb code and found each writer's Sync operation 
> > would take ~30ms to finish. And as shown above, it is strange that 
> > performance has no much difference no matters disk write cache is on or 
> > off.
> > 
> > Do your guys encounter the similar issue? Or do I miss something to 
> > cause rocksdb's poor write performance?
>  
> Yes, I saw the same thing.  This PR addresses the problem and is nearing 
> merge upstream:
>  
>   https://github.com/facebook/rocksdb/pull/746
>  
> There is also an XFS performance bug that is contributing to the problem, 
> but it looks like Dave Chinner just put together a fix for that.
>  
> But... we likely won't be using KeyValueStore in its current form over 
> rocksdb (or any other kv backend).  It stripes object data over key/value 
> pairs, which IMO is not the best approach.
>  
> sage
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Robert LeBlanc
Given enough load, that fast Jornal will get filled and you will only be as
fast as the back disk can flush (and at the same time service reads). That
the the situation we are in right now. We are still seeing better
performance than a raw spindle, but only 150 IOPs, not 15000 IOPS that the
SSD can do. You are still ultimately bound by the back end disk.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Oct 20, 2015 2:34 AM, "Luis Periquito"  wrote:

> On Tue, Oct 20, 2015 at 3:26 AM, Haomai Wang  wrote:
> > The fact is that journal could help a lot for rbd use cases,
> > especially for small ios. I don' t think it will be bottleneck. If we
> > just want to reduce double write, it doesn't solve any performance
> > problem.
> >
>
> One trick I've been using in my ceph clusters is hiding a slow write
> backend behind a fast journal device. The write performance will be of
> the fast (and small) journal device. This only helps on write, but it
> can make a huge difference.
>
> I've even made some tests showing (within 10%, RBD and S3) that the
> backend device doesn't matter and the write performance is exactly the
> same as that of the journal device fronting all the writes.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Mark Nelson
The hope is that with some of Somnath's work and perhaps additional 
future work, we might be able to make the journal a little smarter about 
how much data to keep and when to flush.  While we are still ultimately 
bound by the backend disk performance, we might be able to absorb writes 
in a smarter way and let more coalescing happen for longer periods of 
time than we currently do today.


On 10/20/2015 08:41 AM, Robert LeBlanc wrote:

Given enough load, that fast Jornal will get filled and you will only be
as fast as the back disk can flush (and at the same time service reads).
That the the situation we are in right now. We are still seeing better
performance than a raw spindle, but only 150 IOPs, not 15000 IOPS that
the SSD can do. You are still ultimately bound by the back end disk.

Robert LeBlanc

Sent from a mobile device please excuse any typos.

On Oct 20, 2015 2:34 AM, "Luis Periquito" mailto:periqu...@gmail.com>> wrote:

On Tue, Oct 20, 2015 at 3:26 AM, Haomai Wang mailto:haomaiw...@gmail.com>> wrote:
 > The fact is that journal could help a lot for rbd use cases,
 > especially for small ios. I don' t think it will be bottleneck. If we
 > just want to reduce double write, it doesn't solve any performance
 > problem.
 >

One trick I've been using in my ceph clusters is hiding a slow write
backend behind a fast journal device. The write performance will be of
the fast (and small) journal device. This only helps on write, but it
can make a huge difference.

I've even made some tests showing (within 10%, RBD and S3) that the
backend device doesn't matter and the write performance is exactly the
same as that of the journal device fronting all the writes.
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang
Haimao, you're right. I add such sync option as configurable for our test 
purpose.


Thanks.Zhi Zhang (David)

> Date: Tue, 20 Oct 2015 21:24:49 +0800
> From: haomaiw...@gmail.com
> To: zhangz.da...@outlook.com
> CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> Actually keyvaluestore would submit transaction with sync flag
> too(rely to keyvaluedb impl journal/logfile).
> 
> Yes, if we disable sync flag, keyvaluestore's performance will
> increase a lot. But we dont provide with this option now
> 
> On Tue, Oct 20, 2015 at 9:22 PM, Z Zhang  wrote:
> > Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer
> > look. Yes, I am trying KVStore backend. The reason we are trying it is that
> > few user doesn't have such high requirement on data loss occasionally. It
> > seems KVStore backend without synchronized WAL could achieve better
> > performance than filestore. And only data still in page cache would get lost
> > on machine crashing, not process crashing, if we use WAL but no
> > synchronization. What do you think? ? ? Thanks. Zhi Zhang (David) Date: Tue,
> > 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To:
> > zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com;
> > ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Write performance issue
> > under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I
> > am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb
> > 3.11 as OSD backend. I use rbd to test performance and following > is my
> > cluster info. > > [ceph@xxx ~]$ ceph -s > ? ? cluster
> > b74f3944-d77f-4401-a531-fa5282995808 > ? ? ?health HEALTH_OK > ? ? ?monmap
> > e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > ? ? ? ? ? ? election epoch 1,
> > quorum 0 xxx > ? ? ?osdmap e338: 44 osds: 44 up, 44 in > ? ? ? ? ? ? flags
> > sortbitwise > ? ? ? pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> >> ? ? ? ? ? ? 1940 MB used, 81930 GB / 81932 GB avail > ? ? ? ? ? ? ? ? 2048
> > active+clean > > All the disks are spinning ones with write cache turning
> > on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you
> > using the KeyValueStore backend? > Using fio to generate following write
> > load:? > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K
> > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1?? > > Test
> > result: > WAL enabled + sync: false + disk write cache: on ?will get ~700
> > IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off ?will
> > get only ~25 IOPS. > > I tuned some other rocksdb options, but with no lock.
> > The wip-newstore-frags branch sets some defaults for rocksdb that I think
> > look pretty reasonable (at least given how newstore is using rocksdb). > I
> > tracked down the rocksdb code and found each writer's Sync operation > would
> > take ~30ms to finish. And as shown above, it is strange that > performance
> > has no much difference no matters disk write cache is on or > off. > > Do
> > your guys encounter the similar issue? Or do I miss something to > cause
> > rocksdb's poor write performance? Yes, I saw the same thing. This PR
> > addresses the problem and is nearing merge upstream:
> > https://github.com/facebook/rocksdb/pull/746 There is also an XFS
> > performance bug that is contributing to the problem, but it looks like Dave
> > Chinner just put together a fix for that. But... we likely won't be using
> > KeyValueStore in its current form over rocksdb (or any other kv backend). It
> > stripes object data over key/value pairs, which IMO is not the best
> > approach. sage ___ ceph-users
> > mailing list ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Z Zhang
Got your point. It is not only about the object data itself, but also ceph 
internal metadata.

The best option seems to be your RP and wip-newstore-frags branch. :-)


Thanks.Zhi Zhang (David)

> Date: Tue, 20 Oct 2015 06:25:43 -0700
> From: s...@newdream.net
> To: zhangz.da...@outlook.com
> CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> On Tue, 20 Oct 2015, Z Zhang wrote:
> > Thanks, Sage, for pointing out the PR and ceph branch. I will take a 
> > closer look.
> > 
> > Yes, I am trying KVStore backend. The reason we are trying it is that 
> > few user doesn't have such high requirement on data loss occasionally. 
> > It seems KVStore backend without synchronized WAL could achieve better 
> > performance than filestore. And only data still in page cache would get 
> > lost on machine crashing, not process crashing, if we use WAL but no 
> > synchronization. What do you think?
> 
> That sounds dangerous.  The OSDs are recording internal metadata about the 
> cluster (peering, replication, etc.)... even if you don't care so much 
> about recent user data writes you probably don't want to risk breaking 
> RADOS itself.  If the kv backend is giving you a stale point-in-time 
> consistent copy it's not so bad, but in a power-loss event it could give 
> you problems...
> 
> sage
> 
> > 
> > Thanks. Zhi Zhang (David)
> > 
> > Date: Tue, 20 Oct 2015 05:47:44 -0700
> > From: s...@newdream.net
> > To: zhangz.da...@outlook.com
> > CC: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> > 
> > On Tue, 20 Oct 2015, Z Zhang wrote:
> > > Hi Guys,
> > > 
> > > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> > > rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> > > is my cluster info.
> > > 
> > > [ceph@xxx ~]$ ceph -s
> > > cluster b74f3944-d77f-4401-a531-fa5282995808
> > >  health HEALTH_OK
> > >  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> > > election epoch 1, quorum 0 xxx
> > >  osdmap e338: 44 osds: 44 up, 44 in
> > > flags sortbitwise
> > >   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> > > 1940 MB used, 81930 GB / 81932 GB avail
> > > 2048 active+clean
> > > 
> > > All the disks are spinning ones with write cache turning on. Rocksdb's 
> > > WAL and sst files are on the same disk as every OSD.
> >  
> > Are you using the KeyValueStore backend?
> >  
> > > Using fio to generate following write load: 
> > > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K 
> > > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> > > 
> > > Test result:
> > > WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> > > WAL enabled + sync: true (default) + disk write cache: on|off  will get 
> > > only ~25 IOPS.
> > > 
> > > I tuned some other rocksdb options, but with no lock.
> >  
> > The wip-newstore-frags branch sets some defaults for rocksdb that I think 
> > look pretty reasonable (at least given how newstore is using rocksdb).
> >  
> > > I tracked down the rocksdb code and found each writer's Sync operation 
> > > would take ~30ms to finish. And as shown above, it is strange that 
> > > performance has no much difference no matters disk write cache is on or 
> > > off.
> > > 
> > > Do your guys encounter the similar issue? Or do I miss something to 
> > > cause rocksdb's poor write performance?
> >  
> > Yes, I saw the same thing.  This PR addresses the problem and is nearing 
> > merge upstream:
> >  
> > https://github.com/facebook/rocksdb/pull/746
> >  
> > There is also an XFS performance bug that is contributing to the problem, 
> > but it looks like Dave Chinner just put together a fix for that.
> >  
> > But... we likely won't be using KeyValueStore in its current form over 
> > rocksdb (or any other kv backend).  It stripes object data over key/value 
> > pairs, which IMO is not the best approach.
> >  
> > sage
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >   
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Wido den Hollander
Hi,

In the "newstore direction" thread on ceph-devel I wrote that I'm using
bcache in production and Mark Nelson asked me to share some details.

Bcache is running in two clusters now that I manage, but I'll keep this
information to one of them (the one at PCextreme behind CloudStack).

In this cluster has been running for over 2 years now:

epoch 284353
fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
created 2013-09-23 11:06:11.819520
modified 2015-10-20 15:27:48.734213

The system consists out of 39 hosts:

2U SuperMicro chassis:
* 80GB Intel SSD for OS
* 240GB Intel S3700 SSD for Journaling + Bcache
* 6x 3TB disk

This isn't the newest hardware. The next batch of hardware will be more
disks per chassis, but this is it for now.

All systems were installed with Ubuntu 12.04, but they are all running
14.04 now with bcache.

The Intel S3700 SSD is partitioned with a GPT label:
- 5GB Journal for each OSD
- 200GB Partition for bcache

root@ceph11:~# df -h|grep osd
/dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
/dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
/dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
/dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
/dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
/dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
root@ceph11:~#

root@ceph11:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.3 LTS
Release:14.04
Codename:   trusty
root@ceph11:~# uname -r
3.19.0-30-generic
root@ceph11:~#

"apply_latency": {
"avgcount": 2985023,
"sum": 226219.891559000
}

What did we notice?
- Less spikes on the disk
- Lower commit latencies on the OSDs
- Almost no 'slow requests' during backfills
- Cache-hit ratio of about 60%

Max backfills and recovery active are both set to 1 on all OSDs.

For the next generation hardware we are looking into using 3U chassis
with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
tested those yet, so nothing to say about it.

The current setup is 200GB of cache for 18TB of disks. The new setup
will be 1200GB for 64TB, curious to see what that does.

Our main conclusion however is that it does smoothen the I/O-pattern
towards the disks and that gives a overall better response of the disks.

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placement rule not resolved

2015-10-20 Thread ghislain.chevalier
Hi Robert,

Sorry for replying late

We finally use a step take at root on the production  platform
Even if I tested a rule on the sandbox platform with a step take at a non-root 
level ... and it works.

Brgds

-Message d'origine-
De : Robert LeBlanc [mailto:rob...@leblancnet.us] 
Envoyé : mardi 6 octobre 2015 17:55
À : CHEVALIER Ghislain IMT/OLPS
Cc : ceph-users
Objet : Re: [ceph-users] Placement rule not resolved

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've only done a 'step take ' where  is a root entry. I haven't tried with it 
being under the root. I would suspect it would work, but you can try to put 
your tiers in a root section and test it there.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 6:18 AM,   wrote:
> Hi,
>
>
>
> Context:
>
> Firefly 0.80.9
>
> 8 storage nodes
>
> 176 osds : 14*8 sas and 8*8 ssd
>
> 3 monitors
>
>
>
> I create an alternate crushmap in order to fulfill tiering requirement i.e.
> select ssd or sas.
>
> I created specific buckets “host-ssd” and “host-sas” and regroup them 
> in “tier-ssd” and “tier-sas” under a “tier-root”
>
> E.g. I want to select 1 ssd in 3 distinct hosts
>
>
>
> I don’t understand why the placement rule for sas is working and not 
> for ssd.
>
> Sas are selected even if ,according to the crushmap,  they are not in 
> the right tree.
>
> When sometimes 3 ssd are selected, the pgs stay stuck but active
>
>
>
> I attached the crushmap and ceph osd tree.
>
>
>
> Can someone have a look and tell me where the default is?
>
>
>
> Bgrds
>
> - - - - - - - - - - - - - - - - -
> Ghislain Chevalier
> ORANGE/IMT/OLPS/ASE/DAPI/CSE
>
> Architecte de services d’infrastructure de stockage
>
> Sofware-Defined Storage Architect
> +33299124432
>
> +33788624370
> ghislain.cheval...@orange.com
>
> P Pensez à l'Environnement avant d'imprimer ce message !
>
>
>
> __
> ___
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc pas etre diffuses, 
> exploites ou copies sans autorisation. Si vous avez recu ce message 
> par erreur, veuillez le signaler a l'expediteur et le detruire ainsi 
> que les pieces jointes. Les messages electroniques etant susceptibles 
> d'alteration, Orange decline toute responsabilite si ce message a ete 
> altere, deforme ou falsifie. Merci.
>
> This message and its attachments may contain confidential or 
> privileged information that may be protected by law; they should not 
> be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and 
> delete this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have 
> been modified, changed or falsified.
> Thank you.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWE+7gCRDmVDuy+mK58QAAdlYQAIKPgcewctAfPisSwvdl
iS60T15U2r2rnuh4G3AQjnmI0eb+nj9O7a1ZH7ttL1k3b5bZz/9/qjK+xnBe
z2UvTjdZltlWVkOSjyjRBpU4JWRS2wZXeMIqVcC71NHT4zGD0otQloftrPLA
ciQ73FDWOJgoA+PMca2oHO91IqQ+UZWr6BAs22scumTW9Zwb/E3QxZJyuh2F
3ajkvalXne97IIMO02ZFB+5PZgg34FukvcdJ/Z/eb+GCE1A57mkL9Wuazu46
u1NvatNWH13I0hZruR5ltWqLV8elTnFFd5KU5XWcyeewKbwbEzFUprUI5xlO
uLNN4vGPrcYYmx6Tm8wWEpSFoZrhOF8NHfIbjn3jM+ZAawzozh1WrMTwWXWG
a6hce307WuJVn/fvNY4IKOzUIwyh/OXPUq+R7RvvkGtnAGJn7aBjuUn6mg6x
AE60XWibRzPGsXvRebEeqEzsfuxbxdt+oml02LByoxei+IZScj446HuyiVqp
9skPJEQgEJL8TChs6+ctS6hkZmo9vJ9Ysk14fJSjXIvTV8eJb12LK9aNig7G
gXYxczfV9fjV/h4TKFcKRYddUj7g8tYpXb8ggJMtqP0B1Pi0gfrV5lsDVH6V
r77ZWisSJ9w+f6lMGzRJTnpeDualcolheBvyFKiqrjoEbxivnow9GFXl7WfT
GpUp
=4tHA
-END PGP SIGNATURE-

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

___

[ceph-users] pg incomplete state

2015-10-20 Thread John-Paul Robinson
Hi folks

I've been rebuilding drives in my cluster to add space.  This has gone
well so far.

After the last batch of rebuilds, I'm left with one placement group in
an incomplete state.

[sudo] password for jpr:
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 3.ea is stuck inactive since forever, current state incomplete, last
acting [30,11]
pg 3.ea is stuck unclean since forever, current state incomplete, last
acting [30,11]
pg 3.ea is incomplete, acting [30,11]

I've restarted both OSD a few times but it hasn't cleared the error.

On the primary I see errors in the log related to slow requests:

2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
3 included below; oldest blocked for > 31.922487 secs
2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
3.e4bd50ea) v4 currently reached pg

Note's online suggest this is an issue with the journal and that it may
be possible to export and rebuild thepg.  I don't have firefly.

https://ceph.com/community/incomplete-pgs-oh-my/

Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
but missing entirely on osd.30 (the primary). 

on osd.33 (primary):

crowbar@da0-36-9f-0e-2b-88:~$ du -sk
/var/lib/ceph/osd/ceph-30/current/3.ea_head/
0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/

on osd.11 (secondary):

crowbar@da0-36-9f-0e-2b-40:~$ du -sh
/var/lib/ceph/osd/ceph-11/current/3.ea_head/

63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/

This makes some sense since, my disk drive rebuilding activity
reformatted the primary osd.30.  It also gives me some hope that my data
is not lost.

I understand incomplete means problem with journal, but is there a way
to dig deeper into this or possible to get the secondary's data to take
over?

Thanks,

~jpr



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v0.94.4 Hammer released upgrade

2015-10-20 Thread German Anders
trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
error msg while trying to restart the mon daemons ($ sudo restart
ceph-mon-all):

2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.506244 7fd971a858c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6847
2015-10-20 08:56:37.524079 7fd971a858c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.524101 7fd971a858c0 -1 error checking features: (1)
Operation not permitted

any ideas?

$ ceph -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)


Thanks in advance,

Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released upgrade

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, German Anders wrote:
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
> error msg while trying to restart the mon daemons ($ sudo restart
> ceph-mon-all):
> 
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={7=support shec erasure
> code}

That feature doesn't appear in hammer... it was merged just prior to 
infernalis.  It looks like you must have started ceph-mon with a 
newer version at some point?

> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)

Can you confirm 'ceph-mon -v' is also 0.94.4?

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] add new monitor doesn't update ceph.conf in hammer with ceph-deploy.

2015-10-20 Thread Stefan Eriksson
Hi

I’m using cep-deploy with hammer and recently added a new monitor, I used this: 
http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-mon/ 

But it doesn’t say anything about adding conf manually to /etc/ceph/ceph.conf, 
should we add the new monitor to either of:

mon_initial_members=
mon_host=

my monmap is showing the new monitor and its active if I look at ceph -s, but I 
just want to know best practice, when using cep-deploy to add a new monitor, 
should we manually add new entries to /etc/ceph/ceph.conf and push these out to 
the other monitors through ”ceph config push” ?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released upgrade

2015-10-20 Thread German Anders
Yep also:

$ ceph-mon -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)



*German* 

2015-10-20 11:48 GMT-03:00 Sage Weil :

> On Tue, 20 Oct 2015, German Anders wrote:
> > trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
> > error msg while trying to restart the mon daemons ($ sudo restart
> > ceph-mon-all):
> >
> > 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> > (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> > 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
> > unsupported features: compat={},rocompat={},incompat={7=support shec
> erasure
> > code}
>
> That feature doesn't appear in hammer... it was merged just prior to
> infernalis.  It looks like you must have started ceph-mon with a
> newer version at some point?
>
> > $ ceph -v
> > ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
> Can you confirm 'ceph-mon -v' is also 0.94.4?
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add new monitor doesn't update ceph.conf in hammer with ceph-deploy.

2015-10-20 Thread LOPEZ Jean-Charles
Hi Stefan,

update the ceph.conf file on your ceph-deploy node (~/ceph-deploy/ceph.conf) 
and then push the updated config file to other machines in the cluster as well 
as clients (if your config file is generic between cluster nodes and client 
nodes). If client config file is different you’ll have to perform the update on 
the clients a different way.

Pushing the config from your ceph-deploy machine: ceph-deploy 
--overwrite-config config push node1 node2 node3 ...

JC

> On Oct 20, 2015, at 07:54, Stefan Eriksson  wrote:
> 
> Hi
> 
> I’m using cep-deploy with hammer and recently added a new monitor, I used 
> this: http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-mon/ 
> 
> But it doesn’t say anything about adding conf manually to 
> /etc/ceph/ceph.conf, should we add the new monitor to either of:
> 
> mon_initial_members=
> mon_host=
> 
> my monmap is showing the new monitor and its active if I look at ceph -s, but 
> I just want to know best practice, when using cep-deploy to add a new 
> monitor, should we manually add new entries to /etc/ceph/ceph.conf and push 
> these out to the other monitors through ”ceph config push” ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add new monitor doesn't update ceph.conf in hammer with ceph-deploy.

2015-10-20 Thread LOPEZ Jean-Charles
And forgot.

Yes, update both lines with the new mon node information
mon_initial_members and mon_host

JC

> On Oct 20, 2015, at 07:54, Stefan Eriksson  wrote:
> 
> Hi
> 
> I’m using cep-deploy with hammer and recently added a new monitor, I used 
> this: http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-mon/ 
> 
> But it doesn’t say anything about adding conf manually to 
> /etc/ceph/ceph.conf, should we add the new monitor to either of:
> 
> mon_initial_members=
> mon_host=
> 
> my monmap is showing the new monitor and its active if I look at ceph -s, but 
> I just want to know best practice, when using cep-deploy to add a new 
> monitor, should we manually add new entries to /etc/ceph/ceph.conf and push 
> these out to the other monitors through ”ceph config push” ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add new monitor doesn't update ceph.conf in hammer with ceph-deploy.

2015-10-20 Thread Stefan Eriksson
Thanks! I'll do that, should I add a bug report to mention this in the 
documentation?


Den 2015-10-20 kl. 17:25, skrev LOPEZ Jean-Charles:

And forgot.

Yes, update both lines with the new mon node information
mon_initial_members and mon_host

JC

On Oct 20, 2015, at 07:54, Stefan Eriksson > wrote:


Hi

I’m using cep-deploy with hammer and recently added a new monitor, I 
used this: 
http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-mon/
But it doesn’t say anything about adding conf manually to 
/etc/ceph/ceph.conf, should we add the new monitor to either of:


mon_initial_members=
mon_host=

my monmap is showing the new monitor and its active if I look at ceph 
-s, but I just want to know best practice, when using cep-deploy to 
add a new monitor, should we manually add new entries to 
/etc/ceph/ceph.conf and push these out to the other monitors through 
”ceph config push” ?

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released upgrade

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, German Anders wrote:
> Yep also:
> 
> $ ceph-mon -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)

Do you know how you had another version installed?

I pushed wip-mon-reset-features which should let you override this.. but 
I would figure out how it happened before moving on

s
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-20 Thread Luis Periquito
>
> On 10/20/2015 08:41 AM, Robert LeBlanc wrote:
>>
>> Given enough load, that fast Jornal will get filled and you will only be
>> as fast as the back disk can flush (and at the same time service reads).
>> That the the situation we are in right now. We are still seeing better
>> performance than a raw spindle, but only 150 IOPs, not 15000 IOPS that
>> the SSD can do. You are still ultimately bound by the back end disk.
>>
>> Robert LeBlanc
>>

This is true. However I've seen this happening in "enterprise-grade"
storage systems, where you have an amount of cache which is very quick
to write. However when that finishes you can go to write through mode
or even worse in to back-to-back mode (think sync write IO).

However given the cluster size, the way ceph works, replication
factors, etc, the volume you need to write at once can be very big,
and easily grown with more OSDs/nodes.

OTOH worst case you are exactly where you started: HDD performance.

Also you can start doing "smart" stuff, like allowing small random IO
into the journal, but coalescing into big writes to the back end
filesystem. If there are any problems then you can just replay through
the journal.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor Read Performance with Ubuntu 14.04 LTS 3.19.0-30 Kernel

2015-10-20 Thread Quentin Hartman
I performed this kernel upgrade (to 3.19.30) over the weekend on my
cluster, and my before / after benchmarks were very close to each other,
about 500MB/s each.

On Tue, Oct 6, 2015 at 3:15 PM, Nick Fisk  wrote:

> I'm wondering if you are hitting the "bug" with the readahead changes?
>
> I know the changes to limit readahead to 2MB was introduced in 3.15, but I
> don't know if it was back ported into 3.13 or not. I have a feeling this
> may
> also limit maximum request size to 2MB as well.
>
> If you look in iostat do you see different request sizes between the two
> kernels?
>
> There is a 4.2 kernel with the readahead change reverted, it might be worth
> testing it.
>
>
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back
> /
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > MailingLists - EWS
> > Sent: 06 October 2015 18:12
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Poor Read Performance with Ubuntu 14.04 LTS
> > 3.19.0-30 Kernel
> >
> > > Hi,
> > >
> > > Very interesting!  Did you upgrade the kernel on both the OSDs and
> > > clients
> > or
> > > just some of them?  I remember there were some kernel performance
> > > regressions a little while back.  You might try running perf during
> > > your
> > tests
> > > and look for differences.  Also, iperf might be worth trying to see if
> > it's a
> > > network regression.
> > >
> > > I also have a script that compares output from sysctl which might be
> > > worth trying to see if any defaults changes.
> > >
> > > https://github.com/ceph/cbt/blob/master/tools/compare_sysctl.py
> > >
> > > basically just save systctl -a with both kernels and pass them as
> > arguments to
> > > the python script.
> > >
> > > Mark
> >
> > Mark,
> >
> > The testing was done with 3.19 on the client with 3.13 on the OSD nodes
> > using "rados bench -p bench 50 seq" with an initial "rados bench -p bench
> 50
> > write --no-cleanup". We suspected the network as well and tested with
> iperf
> > as one of our first steps and saw expected speeds (9.9Gb/s as we are
> using
> > bonded X540-T2 interfaces) on both kernels. As an added data point, we
> > have no problem with write performance to the same pool with the same
> > kernel configuration (~1GB/s). We also checked the values of
> > read_ahead_kb of the block devices but both were shown to be the default
> > of 128 (we have since changed these to 4096 in our configuration, but the
> > results were seen with the default of 128).
> >
> > We are in the process of rebuilding the entire cluster to use 3.13 and a
> > completely fresh installation of Ceph to make sure nothing else is at
> play
> > here.
> >
> > We did check a few things in iostat and collectl, but we didn't see any
> read IO
> > against the OSDs, so I am leaning towards something further up the stack.
> >
> > Just a little more background on the cluster configuration:
> >
> > Specific pool created just for benchmarking, using 512 pgs and pgps and 2
> > replicas. Using 3 OSD nodes (also handling MON duties) with 8 SATA 7.2K
> > RPM OSDs and 2 NVMe journals (4 OSD to 1 Journal ratio). 1 x Hex core
> CPUs
> > with 32GB of RAM per OSD node.
> >
> > Tom
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Mark Nelson

On 10/20/2015 09:00 AM, Wido den Hollander wrote:

Hi,

In the "newstore direction" thread on ceph-devel I wrote that I'm using
bcache in production and Mark Nelson asked me to share some details.

Bcache is running in two clusters now that I manage, but I'll keep this
information to one of them (the one at PCextreme behind CloudStack).

In this cluster has been running for over 2 years now:

epoch 284353
fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
created 2013-09-23 11:06:11.819520
modified 2015-10-20 15:27:48.734213

The system consists out of 39 hosts:

2U SuperMicro chassis:
* 80GB Intel SSD for OS
* 240GB Intel S3700 SSD for Journaling + Bcache
* 6x 3TB disk

This isn't the newest hardware. The next batch of hardware will be more
disks per chassis, but this is it for now.

All systems were installed with Ubuntu 12.04, but they are all running
14.04 now with bcache.

The Intel S3700 SSD is partitioned with a GPT label:
- 5GB Journal for each OSD
- 200GB Partition for bcache

root@ceph11:~# df -h|grep osd
/dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
/dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
/dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
/dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
/dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
/dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
root@ceph11:~#

root@ceph11:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.3 LTS
Release:14.04
Codename:   trusty
root@ceph11:~# uname -r
3.19.0-30-generic
root@ceph11:~#

"apply_latency": {
 "avgcount": 2985023,
 "sum": 226219.891559000
}

What did we notice?
- Less spikes on the disk
- Lower commit latencies on the OSDs
- Almost no 'slow requests' during backfills
- Cache-hit ratio of about 60%

Max backfills and recovery active are both set to 1 on all OSDs.

For the next generation hardware we are looking into using 3U chassis
with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
tested those yet, so nothing to say about it.

The current setup is 200GB of cache for 18TB of disks. The new setup
will be 1200GB for 64TB, curious to see what that does.

Our main conclusion however is that it does smoothen the I/O-pattern
towards the disks and that gives a overall better response of the disks.


Hi Wido, thanks for the big writeup!  Did you guys happen to do any 
benchmarking?  I think Xiaoxi looked at flashcache a while back but had 
mixed results if I remember right.  It would be interesting to know how 
bcache is affecting performance in different scenarios.


Thanks,
Mark



Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimum failure domain

2015-10-20 Thread J David
On Mon, Oct 19, 2015 at 7:09 PM, John Wilkins  wrote:
> The classic case is when you are just trying Ceph out on a laptop (e.g.,
> using file directories for OSDs, setting the replica size to 2, and setting
> osd_crush_chooseleaf_type to 0).

Sure, but the text isn’t really applicable in that situation, is it?
It’s specifically calling out the SSD as a single point of failure
when it’s being used to journal multiple OSDs, like that’s an
important consideration in determining the minimum failure domain.
Which, for single-node testing, the minimum failure domain ship has
already pretty much sailed, and on any non-single-node deployment,
testing or otherwise, a node is already realistically already the
minimum failure domain. (And isn’t it the default anyway?)

Likewise, if you’re doing a single-node test with a bunch of OSDs on
one drive, that drive is already a shared failure component, whether
or not journalling is being done to a separate SSD.

> The statement is a guideline. You could, in fact, create a CRUSH hierachy
> consisting of OSD/journal groups within a host too. However, capturing the
> host as a failure domain is preferred if you need to power down the host to
> change a drive (assuming it’s not hot-swappable).

The particular example given is of a single SSD for the entire node.
Inside a given host/node, there are all sorts of single points of
failure.

> There are cases with high density systems where you have multiple nodes in
> the same chassis. So you might opt for a higher minimum failure domain in a
> case like that.

Sure, my question was a bit unclear in that regard.  There are plenty
of cases where you the minimum failure domain might be *larger* than a
node (and you identified several good ones).  Mainly I meant to ask
under what circumstances the minimum failure domain might be *smaller*
than a node.  The only valid answer to that appears to be “testing.”

In light of that, perhaps the text as written is emphasizing on
minimum failure domain unnecessarily, applicable as that is only to
testing, and only to a very specific hardware configuration that
(probably) isn’t very common in testing.  (And, when it is, the
realities of the testing environment where it can come up essentially
require going against the advice given anyway.)

Perhaps the text would be of more benefit to a larger group of readers
if that callout reflected instead the other practicial considerations
of packing multiple journals on one SSD: namely that your cluster must
be designed to withstand the simultaneous failure of all OSDs that
journal to that device, both in terms of excess capacity and
rebalancing throughput.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Udo Lembke
Hi,
do you have changed the ownership like discribed in Sages mail about
"v9.1.0 Infernalis release candidate released"?

  #. Fix the ownership::

   chown -R ceph:ceph /var/lib/ceph

or set ceph.conf to use root instead?
  When upgrading, administrators have two options:

   #. Add the following line to ``ceph.conf`` on all hosts::

setuser match path = /var/lib/ceph/$type/$cluster-$id

  This will make the Ceph daemons run as root (i.e., not drop
  privileges and switch to user ceph) if the daemon's data
  directory is still owned by root.  Newly deployed daemons will
  be created with data owned by user ceph and will run with
  reduced privileges, but upgraded daemons will continue to run as
  root.



Udo

On 20.10.2015 14:59, German Anders wrote:
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the
> following error msg while trying to restart the mon daemons:
>
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features:
> (1) Operation not permitted
> 2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
> 2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features:
> (1) Operation not permitted
>
>
> any ideas?
>
> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
>
> Thanks in advance,
>
> Cheers,
>
> **
>
> *German*
>
> 2015-10-19 18:07 GMT-03:00 Sage Weil  >:
>
> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to
> Infernalis. That is, all users of earlier version of Hammer or any
> version of Firefly will first need to upgrade to hammer v0.94.4 or
> later before upgrading to Infernalis (or future releases).
>
> All v0.94.x Hammer users are strongly encouraged to upgrade.
>
> Changes
> ---
>
> * build/ops: ceph.spec.in : 50-rbd.rules
> conditional is wrong (#12166, Nathan Cutler)
> * build/ops: ceph.spec.in : ceph-common needs
> python-argparse on older distros, but doesn't require it (#12034,
> Nathan Cutler)
> * build/ops: ceph.spec.in : radosgw requires
> apache for SUSE only -- makes no sense (#12358, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: cephfs_java
> not fully conditionalized (#11991, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: not possible
> to turn off Java (#11992, Owen Synge)
> * build/ops: ceph.spec.in : running fdupes
> unnecessarily (#12301, Nathan Cutler)
> * build/ops: ceph.spec.in : snappy-devel for
> all supported distros (#12361, Nathan Cutler)
> * build/ops: ceph.spec.in : SUSE/openSUSE
> builds need libbz2-devel (#11629, Nathan Cutler)
> * build/ops: ceph.spec.in : useless
> %py_requires breaks SLE11-SP3 build (#12351, Nathan Cutler)
> * build/ops: error in ext_mime_map_init() when /etc/mime.types is
> missing (#11864, Ken Dreyer)
> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5
> in 30s) (#11798, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#10927, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11140, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11686, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#12407, Sage Weil)
> * cli: ceph: cli throws exception on unrecognized errno (#11354,
> Kefu Chai)
> * cli: ceph tell: broken error message / misleading hinting
> (#11101, Kefu Chai)
> * common: arm: all programs that link to librados2 hang forever on
> startup (#12505, Boris Ranto)
> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer
> objects (#13070, Sage Weil)
> * common: do not insert emtpy ptr when rebuild emtpy bufferlist
> (#12775, Xinze Chi)
> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
> * common: Memory lea

Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Stefan Eriksson
A change like this below, where we have to change ownership was not add 
to a point release for hammer right?



Den 2015-10-20 kl. 20:06, skrev Udo Lembke:

Hi,
do you have changed the ownership like discribed in Sages mail about 
"v9.1.0 Infernalis release candidate released"?

   #. Fix the ownership::

   chown -R ceph:ceph /var/lib/ceph

or set ceph.conf to use root instead?
   When upgrading, administrators have two options:

#. Add the following line to ``ceph.conf`` on all hosts::

 setuser match path =/var/lib/ceph/$type/$cluster-$id

   This will make the Ceph daemons run as root (i.e., not drop
   privileges and switch to user ceph) if the daemon's data
   directory is still owned by root.  Newly deployed daemons will
   be created with data owned by user ceph and will run with
   reduced privileges, but upgraded daemons will continue to run as
   root.


Udo

On 20.10.2015 14:59, German Anders wrote:
trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the 
following error msg while trying to restart the mon daemons:


2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data 
includes unsupported features: 
compat={},rocompat={},incompat={7=support shec erasure code}
2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: 
(1) Operation not permitted
2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data 
includes unsupported features: 
compat={},rocompat={},incompat={7=support shec erasure code}
2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: 
(1) Operation not permitted



any ideas?

$ ceph -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)


Thanks in advance,

Cheers,

**

*German*

2015-10-19 18:07 GMT-03:00 Sage Weil >:


This Hammer point fixes several important bugs in Hammer, as well as
fixing interoperability issues that are required before an upgrade to
Infernalis. That is, all users of earlier version of Hammer or any
version of Firefly will first need to upgrade to hammer v0.94.4 or
later before upgrading to Infernalis (or future releases).

All v0.94.x Hammer users are strongly encouraged to upgrade.

Changes
---

* build/ops: ceph.spec.in : 50-rbd.rules
conditional is wrong (#12166, Nathan Cutler)
* build/ops: ceph.spec.in : ceph-common
needs python-argparse on older distros, but doesn't require it
(#12034, Nathan Cutler)
* build/ops: ceph.spec.in : radosgw requires
apache for SUSE only -- makes no sense (#12358, Nathan Cutler)
* build/ops: ceph.spec.in : rpm: cephfs_java
not fully conditionalized (#11991, Nathan Cutler)
* build/ops: ceph.spec.in : rpm: not
possible to turn off Java (#11992, Owen Synge)
* build/ops: ceph.spec.in : running fdupes
unnecessarily (#12301, Nathan Cutler)
* build/ops: ceph.spec.in : snappy-devel for
all supported distros (#12361, Nathan Cutler)
* build/ops: ceph.spec.in : SUSE/openSUSE
builds need libbz2-devel (#11629, Nathan Cutler)
* build/ops: ceph.spec.in : useless
%py_requires breaks SLE11-SP3 build (#12351, Nathan Cutler)
* build/ops: error in ext_mime_map_init() when /etc/mime.types is
missing (#11864, Ken Dreyer)
* build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5
in 30s) (#11798, Sage Weil)
* build/ops: With root as default user, unable to have multiple
RGW instances running (#10927, Sage Weil)
* build/ops: With root as default user, unable to have multiple
RGW instances running (#11140, Sage Weil)
* build/ops: With root as default user, unable to have multiple
RGW instances running (#11686, Sage Weil)
* build/ops: With root as default user, unable to have multiple
RGW instances running (#12407, Sage Weil)
* cli: ceph: cli throws exception on unrecognized errno (#11354,
Kefu Chai)
* cli: ceph tell: broken error message / misleading hinting
(#11101, Kefu Chai)
* common: arm: all programs that link to librados2 hang forever
on startup (#12505, Boris Ranto)
* common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
* common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer
objects (#13070, Sage Weil)
* common: do not insert emtpy ptr when rebuild emtpy bufferlist
(#12775, Xinze Chi)
* common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason
Dillaman)
* common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
* comm

Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread German Anders
Hi Udo,

We I've try that and no luck at all.

Cheers,


*German* 

2015-10-20 15:06 GMT-03:00 Udo Lembke :

> Hi,
> do you have changed the ownership like discribed in Sages mail about
> "v9.1.0 Infernalis release candidate released"?
>
>   #. Fix the ownership::
>
>  chown -R ceph:ceph /var/lib/ceph
>
> or set ceph.conf to use root instead?
>   When upgrading, administrators have two options:
>
>#. Add the following line to ``ceph.conf`` on all hosts::
>
> setuser match path = */var/lib/ceph/*$type/$cluster-$id
>
>   This will make the Ceph daemons run as root (i.e., not drop
>   privileges and switch to user ceph) if the daemon's data
>   directory is still owned by root.  Newly deployed daemons will
>   be created with data owned by user ceph and will run with
>   reduced privileges, but upgraded daemons will continue to run as
>   root.
>
>
>
> Udo
>
> On 20.10.2015 14:59, German Anders wrote:
>
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
> error msg while trying to restart the mon daemons:
>
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={7=support shec
> erasure code}
> 2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
> Operation not permitted
> 2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
> 2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={7=support shec
> erasure code}
> 2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
> Operation not permitted
>
>
> any ideas?
>
> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
>
> Thanks in advance,
>
> Cheers,
>
> *German*
>
> 2015-10-19 18:07 GMT-03:00 Sage Weil :
>
>> This Hammer point fixes several important bugs in Hammer, as well as
>> fixing interoperability issues that are required before an upgrade to
>> Infernalis. That is, all users of earlier version of Hammer or any
>> version of Firefly will first need to upgrade to hammer v0.94.4 or
>> later before upgrading to Infernalis (or future releases).
>>
>> All v0.94.x Hammer users are strongly encouraged to upgrade.
>>
>> Changes
>> ---
>>
>> * build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166,
>> Nathan Cutler)
>> * build/ops: ceph.spec.in: ceph-common needs python-argparse on older
>> distros, but doesn't require it (#12034, Nathan Cutler)
>> * build/ops: ceph.spec.in: radosgw requires apache for SUSE only --
>> makes no sense (#12358, Nathan Cutler)
>> * build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized
>> (#11991, Nathan Cutler)
>> * build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992,
>> Owen Synge)
>> * build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan
>> Cutler)
>> * build/ops: ceph.spec.in: snappy-devel for all supported distros
>> (#12361, Nathan Cutler)
>> * build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel
>> (#11629, Nathan Cutler)
>> * build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build
>> (#12351, Nathan Cutler)
>> * build/ops: error in ext_mime_map_init() when /etc/mime.types is missing
>> (#11864, Ken Dreyer)
>> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
>> (#11798, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#10927, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#11140, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#11686, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#12407, Sage Weil)
>> * cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu
>> Chai)
>> * cli: ceph tell: broken error message / misleading hinting (#11101, Kefu
>> Chai)
>> * common: arm: all programs that link to librados2 hang forever on
>> startup (#12505, Boris Ranto)
>> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
>> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects
>> (#13070, Sage Weil)
>> * common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775,
>> Xinze Chi)
>> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
>> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
>> * common: Memory leak in Mutex.cc, pthread_mutexattr_init without
>> pthread_mutexattr_destroy (#11762, Ketor Meng)
>> * common: object_map_update fails with -EINVAL return code (#12611, Jason
>> Dillaman)
>> * common: Pipe: Drop connect_seq increase line

Re: [ceph-users] rbd export hangs / does nothing without regular drop_cache

2015-10-20 Thread Stefan Priebe


Am 20.10.2015 um 15:03 schrieb Jason Dillaman:

Can you provide more details on your setup and how you are running the rbd 
export?


System with Raid 50 and 50TB space.

Just running rbd export ... from command line. I'm exporting to a btrfs 
volume.


Stefan

>  If clearing the pagecache, dentries, and inodes solves the issue, it 
sounds like it's outside of Ceph (unless you are exporting to a CephFS 
or krbd mount point).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph and upgrading OS version

2015-10-20 Thread Andrei Mikhailovsky
Hello everyone 

I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
wondering if you have a recommended process of upgrading the OS version without 
causing any issues to the ceph cluster? 

Many thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Martin Millnert
The thing that worries me with your next-gen design (actually your current 
design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day 
per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read ratio 
is quite high in terms of writes as-is.
You're also throughput-limiting yourself to the pci-e bw of the NVME device 
(regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of 
course in relative terms. NVRAM vs SSD here is simply a choice between wear 
(NVRAM as journal minimum), and cache hit probability (size).  
Interesting thought experiment anyway for me, thanks for sharing Wido.
/M

 Original message 
From: Wido den Hollander  
Date: 20/10/2015  16:00  (GMT+01:00) 
To: ceph-users  
Subject: [ceph-users] Ceph OSDs with bcache experience 

Hi,

In the "newstore direction" thread on ceph-devel I wrote that I'm using
bcache in production and Mark Nelson asked me to share some details.

Bcache is running in two clusters now that I manage, but I'll keep this
information to one of them (the one at PCextreme behind CloudStack).

In this cluster has been running for over 2 years now:

epoch 284353
fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
created 2013-09-23 11:06:11.819520
modified 2015-10-20 15:27:48.734213

The system consists out of 39 hosts:

2U SuperMicro chassis:
* 80GB Intel SSD for OS
* 240GB Intel S3700 SSD for Journaling + Bcache
* 6x 3TB disk

This isn't the newest hardware. The next batch of hardware will be more
disks per chassis, but this is it for now.

All systems were installed with Ubuntu 12.04, but they are all running
14.04 now with bcache.

The Intel S3700 SSD is partitioned with a GPT label:
- 5GB Journal for each OSD
- 200GB Partition for bcache

root@ceph11:~# df -h|grep osd
/dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
/dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
/dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
/dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
/dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
/dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
root@ceph11:~#

root@ceph11:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.3 LTS
Release:14.04
Codename:   trusty
root@ceph11:~# uname -r
3.19.0-30-generic
root@ceph11:~#

"apply_latency": {
    "avgcount": 2985023,
    "sum": 226219.891559000
}

What did we notice?
- Less spikes on the disk
- Lower commit latencies on the OSDs
- Almost no 'slow requests' during backfills
- Cache-hit ratio of about 60%

Max backfills and recovery active are both set to 1 on all OSDs.

For the next generation hardware we are looking into using 3U chassis
with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
tested those yet, so nothing to say about it.

The current setup is 200GB of cache for 18TB of disks. The new setup
will be 1200GB for 64TB, curious to see what that does.

Our main conclusion however is that it does smoothen the I/O-pattern
towards the disks and that gives a overall better response of the disks.

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Francois Lafont
Hi,

On 20/10/2015 20:11, Stefan Eriksson wrote:

> A change like this below, where we have to change ownership was not add to a 
> point release for hammer right?

Right. ;)

I have upgraded my ceph cluster from 0.94.3 to 0.94.4 today without any problem.
The daemons used in 0.94.3 and currently use in 0.94.4 the root account. I have
not changed at all ownership of /var/lib/ceph/ for this upgrade.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Martin Millnert
OK - seems my android email client (native samsung) messed up
"in-reply-to" which confuses some MUA's. Apologies for that (&this)

/M

On Tue, Oct 20, 2015 at 09:45:25PM +0200, Martin Millnert wrote:
> The thing that worries me with your next-gen design (actually your current
> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day
> per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read ratio
> is quite high in terms of writes as-is.
> 
> You're also throughput-limiting yourself to the pci-e bw of the NVME device
> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of
> course in relative terms. NVRAM vs SSD here is simply a choice between wear
> (NVRAM as journal minimum), and cache hit probability (size).  
> 
> Interesting thought experiment anyway for me, thanks for sharing Wido.
> 
> /M
> 
> 
>  Original message 
> From: Wido den Hollander 
> Date: 20/10/2015 16:00 (GMT+01:00)
> To: ceph-users 
> Subject: [ceph-users] Ceph OSDs with bcache experience
> 
> Hi,
> 
> In the "newstore direction" thread on ceph-devel I wrote that I'm using
> bcache in production and Mark Nelson asked me to share some details.
> 
> Bcache is running in two clusters now that I manage, but I'll keep this
> information to one of them (the one at PCextreme behind CloudStack).
> 
> In this cluster has been running for over 2 years now:
> 
> epoch 284353
> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
> created 2013-09-23 11:06:11.819520
> modified 2015-10-20 15:27:48.734213
> 
> The system consists out of 39 hosts:
> 
> 2U SuperMicro chassis:
> * 80GB Intel SSD for OS
> * 240GB Intel S3700 SSD for Journaling + Bcache
> * 6x 3TB disk
> 
> This isn't the newest hardware. The next batch of hardware will be more
> disks per chassis, but this is it for now.
> 
> All systems were installed with Ubuntu 12.04, but they are all running
> 14.04 now with bcache.
> 
> The Intel S3700 SSD is partitioned with a GPT label:
> - 5GB Journal for each OSD
> - 200GB Partition for bcache
> 
> root@ceph11:~# df -h|grep osd
> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
> /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
> /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
> /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
> /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
> root@ceph11:~#
> 
> root@ceph11:~# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 14.04.3 LTS
> Release: 14.04
> Codename: trusty
> root@ceph11:~# uname -r
> 3.19.0-30-generic
> root@ceph11:~#
> 
> "apply_latency": {
> "avgcount": 2985023,
> "sum": 226219.891559000
> }
> 
> What did we notice?
> - Less spikes on the disk
> - Lower commit latencies on the OSDs
> - Almost no 'slow requests' during backfills
> - Cache-hit ratio of about 60%
> 
> Max backfills and recovery active are both set to 1 on all OSDs.
> 
> For the next generation hardware we are looking into using 3U chassis
> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
> tested those yet, so nothing to say about it.
> 
> The current setup is 200GB of cache for 18TB of disks. The new setup
> will be 1200GB for 64TB, curious to see what that does.
> 
> Our main conclusion however is that it does smoothen the I/O-pattern
> towards the disks and that gives a overall better response of the disks.
> 
> Wido
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Andrei Mikhailovsky
Same here, the upgrade went well. So far so good. 



- Original Message -

From: "Francois Lafont"  
To: "ceph-users"  
Sent: Tuesday, 20 October, 2015 9:14:43 PM 
Subject: Re: [ceph-users] v0.94.4 Hammer released 

Hi, 

On 20/10/2015 20:11, Stefan Eriksson wrote: 

> A change like this below, where we have to change ownership was not add to a 
> point release for hammer right? 

Right. ;) 

I have upgraded my ceph cluster from 0.94.3 to 0.94.4 today without any 
problem. 
The daemons used in 0.94.3 and currently use in 0.94.4 the root account. I have 
not changed at all ownership of /var/lib/ceph/ for this upgrade. 

-- 
François Lafont 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Lindsay Mathieson
On 21 October 2015 at 08:09, Andrei Mikhailovsky  wrote:

> Same here, the upgrade went well. So far so good.
>

Ditto


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How ceph client abort IO

2015-10-20 Thread min fang
I want to abort and retry a IO if taking longer time not completed. Does
this make sense in Ceph? How ceph client handle longer timeout IOs? Just
wait until it returned, or other error recovery method can be used to
handle IO which can not be responsed in time.

Thanks.

2015-10-20 21:00 GMT+08:00 Jason Dillaman :

> There is no such interface currently on the librados / OSD side to abort
> IO operations.  Can you provide some background on your use-case for
> aborting in-flight IOs?
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "min fang" 
> > To: ceph-users@lists.ceph.com
> > Sent: Monday, October 19, 2015 6:41:40 PM
> > Subject: [ceph-users] How ceph client abort IO
>
> > Can librbd interface provide abort api for aborting IO? If yes, can the
> abort
> > interface detach write buffer immediately? I hope can reuse the write
> buffer
> > quickly after issued the abort request, while not waiting IO aborted in
> osd
> > side.
>
> > thanks.
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-20 Thread hzwuli...@gmail.com
Hi, 

Thanks for you reply.

I do more test here and things change more strange, now i only could get about 
4k iops in VM:
1. use fio with ioengine rbd to test the volume on the real machine
[global]
ioengine=rbd
clientname=admin
pool=vol_ssd
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f
rw=randwrite
bs=4k
group_reporting=1

[rbd_iodepth32]
iodepth=32
[rbd_iodepth1]
iodepth=32
[rbd_iodepth28]
iodepth=32
[rbd_iodepth8]
iodepth=32

could achive about 18k iops.

2. test the same volume in VM, achive about 4.3k iops
[global]
rw=randwrite
bs=4k
ioengine=libaio
#ioengine=sync
iodepth=128
direct=1
group_reporting=1
thread=1
filename=/dev/vdb

[task1]
iodepth=32
[task2]
iodepth=32
[task3]
iodepth=32
[task4]
iodepth=32

Using cep osd perf to check the osd latency, all less than 1 ms.
Using iostat to check the osd %util, about 10 in case 2 test.
Using dstat to check VM status:
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  2   4  51  43   0   0|   017M| 997B 3733B|   0 0 |3476  6997 
  2   5  51  43   0   0|   018M| 714B 4335B|   0 0 |3439  6915 
  2   5  50  43   0   0|   017M| 594B 3150B|   0 0 |3294  6617 
  1   3  52  44   0   0|   018M| 648B 3726B|   0 0 |3447  6991 
  1   5  51  43   0   0|   018M| 582B 3208B|   0 0 |3467  7061 

Finally, using iptraf to check the package size in the VM, almost packages's
size are around 1 to 70 and 71 to 140 bytes. That's different from real machine.

But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal.

BTW, my VM is located on the ceph storage node.

Anyone can give me more sugestions?

Thanks!



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-20 19:36
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Hi,
 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd,
when I compile qemu with jemalloc
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd,
so cpu could be bottleneck too.
 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "ceph-users" 
Envoyé: Mardi 20 Octobre 2015 10:22:33
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 
 
volume1 got about 24k IOPS and volume got about 14k IOPS. 
 
We could see performance of volume2 is not good compare to volume1, so is it 
normal behabior of guest host? 
If not, what maybe the problem? 
 
Thanks! 
 
hzwuli...@gmail.com 
 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-20 Thread Alexandre DERUMIER
Damn, that's a huge difference.

What is your host os, guest os , qemu version and vm config ?



As an extra boost, you could enable iothread on virtio disk.
(It's available on libvirt but not on openstack yet).

If it's a test server, maybe could you test it with proxmox 4.0 hypervisor
https://www.proxmox.com

I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...)


- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 06:11:20
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 

Thanks for you reply. 

I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 

[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 

could achive about 18k iops. 

2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 

[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 

Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 

Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 

But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 

BTW, my VM is located on the ceph storage node. 

Anyone can give me more sugestions? 

Thanks! 


hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd, 
so cpu could be bottleneck too. 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd 
- Mail original - 
De: hzwuli...@gmail.com 
À: "ceph-users"  
Envoyé: Mardi 20 Octobre 2015 10:22:33 
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 
volume1 got about 24k IOPS and volume got about 14k IOPS. 
We could see performance of volume2 is not good compare to volume1, so is it 
normal behabior of guest host? 
If not, what maybe the problem? 
Thanks! 
hzwuli...@gmail.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-20 Thread Lindsay Mathieson
On 21 October 2015 at 16:01, Alexandre DERUMIER  wrote:

> If it's a test server, maybe could you test it with proxmox 4.0 hypervisor
> https://www.proxmox.com
>
> I have made a lot of patch inside it to optimize rbd (qemu+jemalloc,
> iothreads,...)
>

Really gotta find time to uograde my cluster ...


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com