Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-09 Thread Xiaoxi Chen
Yeah I checked the dump  , it it truely the known issue.

Thanks

2017-03-08 17:58 GMT+08:00 John Spray :
> On Tue, Mar 7, 2017 at 3:05 PM, Xiaoxi Chen  wrote:
>> Thanks John.
>>
>> Very likely, note that mds_mem::ino + mds_cache::strays_created ~=
>> mds::inodes, plus the MDS was the active-standby one, and become
>> active days ago due to failover.
>>
>> mds": {
>> "inodes": 1291393,
>> }
>> "mds_cache": {
>> "num_strays": 3559,
>> "strays_created": 706120,
>> "strays_purged": 702561
>> }
>> "mds_mem": {
>> "ino": 584974,
>> }
>>
>> I do have a cache dump from the mds via admin socket,  is there
>> anything I can get from it  to make 100% percent sure?
>
> You could go through that dump and look for the dentries with no inode
> number set, but honestly if this is a previously-standby-replay daemon
> and you're running pre-Kraken code I'd be pretty sure it's the known
> issue.
>
> John
>
>>
>> Xiaoxi
>>
>> 2017-03-07 22:20 GMT+08:00 John Spray :
>>> On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen  wrote:
 Hi,

   From the admin socket of mds, I got following data on our
 production cephfs env, roughly we have 585K inodes and almost same
 amount of caps, but we have>2x dentries than inodes.

   I am pretty sure we dont use hard link intensively (if any).
 And the #ino match with "rados ls --pool $my_data_pool}.

   Thanks for any explanations, appreciate it.


 "mds_mem": {
 "ino": 584974,
 "ino+": 1290944,
 "ino-": 705970,
 "dir": 25750,
 "dir+": 25750,
 "dir-": 0,
 "dn": 1291393,
 "dn+": 1997517,
 "dn-": 706124,
 "cap": 584560,
 "cap+": 2657008,
 "cap-": 2072448,
 "rss": 24599976,
 "heap": 166284,
 "malloc": 18446744073708721289,
 "buf": 0
 },

>>>
>>> One possibility is that you have many "null" dentries, which are
>>> created when we do a lookup and a file is not found -- we create a
>>> special dentry to remember that that filename does not exist, so that
>>> we can return ENOENT quickly next time.  On pre-Kraken versions, null
>>> dentries can also be left behind after file deletions when the
>>> deletion is replayed on a standbyreplay MDS
>>> (http://tracker.ceph.com/issues/16919)
>>>
>>> John
>>>
>>>
>>>

 Xiaoxi
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW listing users' quota and usage painfully slow

2017-03-09 Thread Matthew Vernon

Hi,

I'm using Jewel / 10.2.3-0ubuntu0.16.04.2 . We want to keep track of our 
S3 users' quota and usage. Even with a relatively small number of users 
(23) it's taking ~23 seconds.


What we do is (in outline):
radosgw-admin metadata list user
for each user X:
  radosgw-admin user info --uid=X  #has quota details
  radosgw-admin user stats --uid=X #has usage details

None of these calls is particularly slow (~0.5s), but the net result is 
not very satisfactory.


What am I doing wrong? :)

Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW listing users' quota and usage painfully slow

2017-03-09 Thread Abhishek Lekshmanan



On 03/09/2017 11:26 AM, Matthew Vernon wrote:

Hi,

I'm using Jewel / 10.2.3-0ubuntu0.16.04.2 . We want to keep track of our
S3 users' quota and usage. Even with a relatively small number of users
(23) it's taking ~23 seconds.

What we do is (in outline):
radosgw-admin metadata list user
for each user X:
  radosgw-admin user info --uid=X  #has quota details
  radosgw-admin user stats --uid=X #has usage details

None of these calls is particularly slow (~0.5s), but the net result is
not very satisfactory.

What am I doing wrong? :)


Is this a single site or a multisite cluster? If you're only trying to 
read info you could try disabling the cache (it is not recommended to 
use this if you're trying to write/modify info) for eg:


$ radosgw-admin user info --uid=x --rgw-cache-enabled=false

also you could run the info command with higher debug (--debug-rgw=20 
--debug-ms=1) and paste that somewhere (its very verbose) to help 
identify where we're slowing down


Best,
Abhishek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW listing users' quota and usage painfully slow

2017-03-09 Thread Matthew Vernon

On 09/03/17 10:45, Abhishek Lekshmanan wrote:


On 03/09/2017 11:26 AM, Matthew Vernon wrote:


I'm using Jewel / 10.2.3-0ubuntu0.16.04.2 . We want to keep track of our
S3 users' quota and usage. Even with a relatively small number of users
(23) it's taking ~23 seconds.

What we do is (in outline):
radosgw-admin metadata list user
for each user X:
  radosgw-admin user info --uid=X  #has quota details
  radosgw-admin user stats --uid=X #has usage details

None of these calls is particularly slow (~0.5s), but the net result is
not very satisfactory.

What am I doing wrong? :)


Is this a single site or a multisite cluster? If you're only trying to
read info you could try disabling the cache (it is not recommended to
use this if you're trying to write/modify info) for eg:


It's a single site.


$ radosgw-admin user info --uid=x --rgw-cache-enabled=false


That doesn't noticably change the execution time (perhaps it improves it 
a little)



also you could run the info command with higher debug (--debug-rgw=20
--debug-ms=1) and paste that somewhere (its very verbose) to help
identify where we're slowing down


https://drive.google.com/drive/folders/0B4TV1iNptBAdMEdUaGJIa3U1QVE?usp=sharing

Should let you see the output from running this with that 
cache-disabling option (and without).


Naiively, I find myself wondering if some sort of all-users flag to the 
info and stats command or a "tell me usage and quota with one call" 
command would be quicker.


Thanks,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW listing users' quota and usage painfully slow

2017-03-09 Thread Matthew Vernon

On 09/03/17 11:28, Matthew Vernon wrote:


https://drive.google.com/drive/folders/0B4TV1iNptBAdMEdUaGJIa3U1QVE?usp=sharing


[For the avoidance of doubt, I've changed the key associated with that 
S3 account :-) ]


Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

2017-03-09 Thread Vincent Godin
First of all, don't do a ceph upgrade while your cluster is in warning or
error state. A process upgrade must be done from an clean cluster.

Don't stay with a replicate at 2. Majority of problems come from that
point: just look the advices given by experience users of the list. You
should set a replicate of 3 and a min_size at 2. This will prevent you to
fail some data because of a double fault which is frequent.

For your specific problem, i have no idea of the root cause. If you have
already checked your network (tuning parameters, enable jumbo, etc..), your
software version on all the components, your hardware (raid card, system
messages, ...), may be you should just re-install your first OSD server. I
had a big problem after an upgrade from hammer to jewel and nobody seems to
have encountered it doing the same operation. All servers were configured
the same way but they had not the same history.We found that the problem
came from the differents versions we installed on some OSD servers (giant
-> hammer -> jewel). OSD servers which never knew the giant version had no
problem at all. We had on the problematic servers (in jewel) some bugs
which was corrected years ago in giant !!!. So we have to isolate those
servers and reinstall them directly in jewel : it solved the problem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW listing users' quota and usage painfully slow

2017-03-09 Thread Orit Wasserman
On Thu, Mar 9, 2017 at 1:28 PM, Matthew Vernon  wrote:

> On 09/03/17 10:45, Abhishek Lekshmanan wrote:
>
> On 03/09/2017 11:26 AM, Matthew Vernon wrote:
>>
>>>
>>> I'm using Jewel / 10.2.3-0ubuntu0.16.04.2 . We want to keep track of our
>>> S3 users' quota and usage. Even with a relatively small number of users
>>> (23) it's taking ~23 seconds.
>>>
>>> What we do is (in outline):
>>> radosgw-admin metadata list user
>>> for each user X:
>>>   radosgw-admin user info --uid=X  #has quota details
>>>   radosgw-admin user stats --uid=X #has usage details
>>>
>>> None of these calls is particularly slow (~0.5s), but the net result is
>>> not very satisfactory.
>>>
>>> What am I doing wrong? :)
>>>
>>
>> Is this a single site or a multisite cluster? If you're only trying to
>> read info you could try disabling the cache (it is not recommended to
>> use this if you're trying to write/modify info) for eg:
>>
>
> It's a single site.
>
> $ radosgw-admin user info --uid=x --rgw-cache-enabled=false
>>
>
> That doesn't noticably change the execution time (perhaps it improves it a
> little)
>
> also you could run the info command with higher debug (--debug-rgw=20
>> --debug-ms=1) and paste that somewhere (its very verbose) to help
>> identify where we're slowing down
>>
>
> https://drive.google.com/drive/folders/0B4TV1iNptBAdMEdUaGJI
> a3U1QVE?usp=sharing
>
>
A quick look at the logs seems to indicate that most of the time is spent
on the bootstrap , I will investigate it further.

Should let you see the output from running this with that cache-disabling
> option (and without).
>
> Naiively, I find myself wondering if some sort of all-users flag to the
> info and stats command or a "tell me usage and quota with one call" command
> would be quicker.
>

That sounds a good idea :), can you open a tracker feature issue for this

Regards,
Orit

>
> Thanks,
>
> Matthew
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a company
> registered in England with number 2742969, whose registered office is 215
> Euston Road, London, NW1 2BE. __
> _
>
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

2017-03-09 Thread pascal.pu...@pci-conseil.net

Le 09/03/2017 à 13:03, Vincent Godin a écrit :
First of all, don't do a ceph upgrade while your cluster is in warning 
or error state. A process upgrade must be done from an clean cluster.

of course.

So, Yesterday, so I try this for my "unfound PG"

ceph pg 50.2dd mark_unfound_lost revert => MON crash :(
so :
ceph pg 50.2dd mark_unfound_lost delete => OK.

Cluster was Health OK => So I finaly migrate all to version Jewel 
10.2.6. So this night, nothing, all worked fine (trimfs of rbd was 
disabled).

Maybe next. It's always after two days. (scrubing  is 22h to 6h).

Don't stay with a replicate at 2. Majority of problems come from that 
point: just look the advices given by experience users of the list. 
You should set a replicate of 3 and a min_size at 2. This will prevent 
you to fail some data because of a double fault which is frequent.
I already had a faulty some pg found by scrubbing processus (disk IO 
error) and had to remove bad PG myself. As I understood, with 3 replica, 
repair would be automatique.

Ok, I will change to 3. :)
For your specific problem, i have no idea of the root cause. If you 
have already checked your network (tuning parameters, enable jumbo, 
etc..), your software version on all the components, your hardware 
(raid card, system messages, ...), may be you should just re-install 
your first OSD server. I had a big problem after an upgrade from 
hammer to jewel and nobody seems to have encountered it doing the same 
operation. All servers were configured the same way but they had not 
the same history.We found that the problem came from the differents 
versions we installed on some OSD servers (giant -> hammer -> jewel). 
OSD servers which never knew the giant version had no problem at all. 
We had on the problematic servers (in jewel) some bugs which was 
corrected years ago in giant !!!. So we have to isolate those servers 
and reinstall them directly in jewel : it solved the problem.


OK. I will think about it.

But, all node are realy same = > check all node with rpm -Va => OK. 
Tuning all, etc... check network ok... It came just the day after upgrade :)


Thanks for you advise. We will see this night. :)

Pascal.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs stuck inactive

2017-03-09 Thread Laszlo Budai

Hello,

After a major network outage our ceph cluster ended up with an inactive PG:

# ceph health detail
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 
requests are blocked > 32 sec; 1 osds have slow requests
pg 3.367 is stuck inactive for 912263.766607, current state incomplete, last 
acting [28,35,2]
pg 3.367 is stuck unclean for 912263.766688, current state incomplete, last 
acting [28,35,2]
pg 3.367 is incomplete, acting [28,35,2]
1 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.28
1 osds have slow requests

# ceph -s
cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
 health HEALTH_WARN
1 pgs incomplete
1 pgs stuck inactive
1 pgs stuck unclean
1 requests are blocked > 32 sec
 monmap e3: 3 mons at 
{tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
election epoch 72, quorum 0,1,2 tv-dl360-1,tv-dl360-2,tv-dl360-3
 osdmap e60609: 72 osds: 72 up, 72 in
  pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778 objects
490 GB used, 130 TB / 130 TB avail
4863 active+clean
   1 incomplete
  client io 0 B/s rd, 38465 B/s wr, 2 op/s

ceph pg repair doesn't change anything. What should I try to recover it?
Attached is the result of ceph pg query on the problem PG.

Thank you,
Laszlo


pg_3.367_query.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs and erasure coding

2017-03-09 Thread Rhian Resnick
Thanks for the confirmations of what is possible.

We plan on creating a new file system, rsync and delete the old one.

Rhian

On Mar 9, 2017 2:27 AM, Maxime Guyot  wrote:

Hi,



>“The answer as to how to move an existing cephfs pool from replication to 
>erasure coding (and vice versa) is to create the new pool and rsync your data 
>between them.”

Shouldn’t it be possible to just do the “ceph osd tier add  ecpool cachepool && 
ceph osd tier cache-mode cachepool writeback” and let Ceph redirect the 
requests (CephFS or other) to the cache pool?



Cheers,

Maxime



From: ceph-users  on behalf of David Turner 

Date: Wednesday 8 March 2017 22:27
To: Rhian Resnick , "ceph-us...@ceph.com" 

Subject: Re: [ceph-users] cephfs and erasure coding



I use CephFS on erasure coding at home using a cache tier.  It works fine for 
my use case, but we know nothing about your use case to know if it will work 
well for you.

The answer as to how to move an existing cephfs pool from replication to 
erasure coding (and vice versa) is to create the new pool and rsync your data 
between them.



[cid:image001.jpg@01D298AE.DE1475E0]


David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943




If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.





From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rhian Resnick 
[rresn...@fau.edu]
Sent: Wednesday, March 08, 2017 12:54 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] cephfs and erasure coding

Two questions on Cephfs and erasure coding that Google couldn't answer.





1) How well does cephfs work with erasure coding?



2) How would you move an existing cephfs pool that uses replication to erasure 
coding?



Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology



Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [mage] 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does ceph preserve read/write consistency?

2017-03-09 Thread Wei Jin
On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒  wrote:
> Hi, everyone.

> As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and 
> READ req with tid 6476 arrived at 18:59:55.030936, however, the latter 
> finished at 19:00:20:89 while the former finished commit at 
> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we 
> found that between the start and finish of each req, there was a lot of 
> "dequeue_op" of that req. We read the source code, it seems that this is due 
> to "RWState", is that correct?
>
> And also, it seems that OSD won't distinguish reqs from different clients, so 
> is it possible that io reqs from the same client also finish in a different 
> order than that they were created in? Could this affect the read/write 
> consistency? For instance, that a read can't acquire the data that were 
> written by the same client just before it.
>

IMO, that doesn't make sense for rados to distinguish reqs from
different clients.
Clients or Users should do it by themselves.

However, as for one specific client, ceph can and must guarantee the
request order.

1) ceph messenger (network layer) has in_seq and out_seq when
receiving and sending message

2) message will be dispatched or fast dispatched and then be queued in
ShardedOpWq in order.

If requests belong to different pgs, they may be processed
concurrently, that's ok.

If requests belong to the same pg, they will be queued in the same
shard and will be processed in order due to pg lock (both read and
write).
For continuous write, op will be queued in ObjectStore in order due to
pg lock and ObjectStore has OpSequence to guarantee the order when
applying op to page cache, that's ok.

With regard to  'read after write' to the same object, ceph must
guarantee read can get the correct write content. That's done by
ondisk_read/write_lock in ObjectContext.


> We are testing hammer version, 0.94.5.  Please help us, thank you:-)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Map Costs (Was: Snapshot Costs (Was: Re: Pool Sizes))

2017-03-09 Thread Kent Borg

On 03/08/2017 05:07 PM, Gregory Farnum wrote:

How about iterating through a whole set of values vs. reading a RADOS object
holding the same amount of data?

"Iterating"?


As in rados_read_op_omap_get_vals(), "Start iterating over key/value 
pairs on an object."



In general, you should use the format that is appropriate for the data
and usage pattern rather than worrying about performance — they are
optimized for the interfaces we expose! ;)


But looks can deceive. For example, your API exposes a call to find out 
how may objects are in a pool. But, experimentally I discover it can be 
both low and high. Once I understand (better) how it is implemented, I 
can see why I should not use that for more than an estimate.


Or, silly me, I saw an interface that exposes creation and deletion of 
pools: Don't do that! (Well, hardly ever.)


Understanding how it works under the hood makes these things much clearer.

Another example, omap values vs. xattrs: they are an odd set of 
siblings, but they make much more sense once one knows the 
implementation differences.


(I am guessing that an xattr read or write--to an XFS OSD--would be 
faster than an omap read or write. Unless the xattr overflows in size to 
become a LevelDB transaction. Right? Also, I can imagine xattrs 
deprecating once Bluestore settles in and starts to get comfortable.)



Ceph is like some strange predator that can swallow beasts far, far 
bigger than it. (EMC and Netapp and...?) Us folk out here, programming 
at the RADOS layer (though I am starting to think maybe there are very 
few of me), need to understand which parts are dang so stretchy and 
which ones are not. The dynamic range between what is suitable to put in 
a single xattr and petabytes of a cluster is considerable. There is a 
lot of room to scale things wrong.


But I am getting the hang of it. Slowly.


Thanks,

-kb
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bogus "inactive" errors during OSD restarts with Jewel

2017-03-09 Thread Ruben Kerkhof
On Thu, Mar 9, 2017 at 3:04 AM, Christian Balzer  wrote:
>
>
> Hello,
>
> during OSD restarts with Jewel (10.2.5 and .6 at least) I've seen
> "stuck inactive for more than 300 seconds" errors like this when observing
> things with "watch ceph -s" :
> ---
>  health HEALTH_ERR
> 59 pgs are stuck inactive for more than 300 seconds
> 223 pgs degraded
> 74 pgs peering
> 84 pgs stale
> 59 pgs stuck inactive
> 297 pgs stuck unclean
> 223 pgs undersized
> recovery 38420/179352 objects degraded (21.422%)
> 2/16 in osds are down
> ---
>
> Now this is is neither reflected in any logs, nor true of course (the
> restarts take a few seconds per OSD and the cluster is fully recovered
> to HEALTH_OK in 12 seconds or so.
>
> But it surely is a good scare for somebody not doing this on a test
> cluster.
>
> Anybody else seeing this?

Definitely. ceph -w shows them as well. They indeed always clear after
a few seconds.

>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

Kind regards,

Ruben Kerkhof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph with RDMA

2017-03-09 Thread PR PR
Hi,

I am trying to use ceph with RDMA. I have a few questions.

1. Is there a prebuilt package that has rdma support or the only way to try
ceph+rdma is to checkout from github and compile from scratch?

2. Looks like there are two ways of using rdma - xio and async+rdma. Which
is the recommended approach? Also, any insights on the differences will be
useful as well.

3. async+rdma seems to have lot of recent changes. Is 11.2.0 expected to
work for async+rdma? As when I compiled 11.2.0 it fails with following error

[ 81%] Built target rbd
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_free_device_list'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_get_cq_event'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_alloc_pd'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_close_device'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_destroy_qp'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_modify_qp'
/mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
`ibv_get_async_event'
***snipped***
Link Error: Ceph FS library not found
src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/build.make:57: recipe for
target 'src/pybind/cephfs/CMakeFiles/cython_cephfs' failed
make[2]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs] Error 1
CMakeFiles/Makefile2:4015: recipe for target
'src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all' failed
make[1]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs
[ 85%] Built target rgw_a

Thanks,
PR
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs and erasure coding

2017-03-09 Thread Rhian Resnick
Thanks everyone for the input. We are online in our test environment and are 
running user workflows to make sure everything is running as expected.



Rhian

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rhian 
Resnick
Sent: Thursday, March 9, 2017 8:31 AM
To: Maxime Guyot 
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] cephfs and erasure coding

Thanks for the confirmations of what is possible.

We plan on creating a new file system, rsync and delete the old one.

Rhian

On Mar 9, 2017 2:27 AM, Maxime Guyot 
mailto:maxime.gu...@elits.com>> wrote:

Hi,



>“The answer as to how to move an existing cephfs pool from replication to 
>erasure coding (and vice versa) is to create the new pool and rsync your data 
>between them.”

Shouldn’t it be possible to just do the “ceph osd tier add  ecpool cachepool && 
ceph osd tier cache-mode cachepool writeback” and let Ceph redirect the 
requests (CephFS or other) to the cache pool?



Cheers,

Maxime



From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David Turner 
mailto:david.tur...@storagecraft.com>>
Date: Wednesday 8 March 2017 22:27
To: Rhian Resnick mailto:rresn...@fau.edu>>, 
"ceph-us...@ceph.com" 
mailto:ceph-us...@ceph.com>>
Subject: Re: [ceph-users] cephfs and erasure coding



I use CephFS on erasure coding at home using a cache tier.  It works fine for 
my use case, but we know nothing about your use case to know if it will work 
well for you.

The answer as to how to move an existing cephfs pool from replication to 
erasure coding (and vice versa) is to create the new pool and rsync your data 
between them.



[cid:image001.jpg@01D298AE.DE1475E0]


David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943




If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.





From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rhian Resnick 
[rresn...@fau.edu]
Sent: Wednesday, March 08, 2017 12:54 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] cephfs and erasure coding

Two questions on Cephfs and erasure coding that Google couldn't answer.





1) How well does cephfs work with erasure coding?



2) How would you move an existing cephfs pool that uses replication to erasure 
coding?



Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology



Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [mage] 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shrinking lab cluster to free hardware for a new deployment

2017-03-09 Thread Ben Hines
AFAIK depending on how many you have, you are likely to end up with 'too
many pgs per OSD' warning for your main pool if you do this, because the
number of PGs in a pool cannot be reduced and there will be less OSDs to
put them on.

-Ben

On Wed, Mar 8, 2017 at 5:53 AM, Henrik Korkuc  wrote:

> On 17-03-08 15:39, Kevin Olbrich wrote:
>
> Hi!
>
> Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each).
> We want to shut down the cluster but it holds some semi-productive VMs we
> might or might not need in the future.
> To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we use
> size 2 and min_size 1).
>
> Should I set the OSDs out one by one or with norefill, norecovery flags
> set but all at once?
> If last is the case, which flags should be set also?
>
> just set OSDs out and wait for them to rebalace, OSDs will be active and
> serve traffic while data will be moving off them. I had a case where some
> pgs wouldn't move out, so after everything settles, you may need to remove
> OSDs from crush one by one.
>
> Thanks!
>
> Kind regards,
> Kevin Olbrich.
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Map Costs (Was: Snapshot Costs (Was: Re: Pool Sizes))

2017-03-09 Thread Max Yehorov
re: python library

you can do some mon calls using this:

##--
from ceph_argparse import json_command as json_command

rados_inst = rados.Rados()
cluster_handle = rados_inst.connect()

cmd = {'prefix': 'pg dump', 'dumpcontents': ['summary', ], 'format': 'json'}
retcode, jsonret, errstr = json_command(cluster_handle, argdict=cmd)
##--


MON commands
https://github.com/ceph/ceph/blob/a68106934c5ed28d0195d6104bce5981aca9aa9d/src/mon/MonCommands.h

On Wed, Mar 8, 2017 at 2:01 PM, Kent Borg  wrote:
> I'm slowly working my way through Ceph's features...
>
> I recently happened upon object maps. (I had heard of LevelDB being in there
> but never saw how to use it: That's because I have been using Python! And
> the Python library is missing lots of features! Grrr.)
>
> How fast are those omap calls?
>
> Which is faster: a single LevelDB query yielding a few bytes vs. a single
> RADOS object read of that many bytes at a specific offset?
>
> How about iterating through a whole set of values vs. reading a RADOS object
> holding the same amount of data?
>
> Thanks,
>
> -kb, the Kent who is guessing LevelDB will be slower in both cases, because
> he really isn't using the key/value aspect of LevelDB but is still paying
> for it.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-09 Thread Max Yehorov
There was definitely missing the "list watchers" command.

On Wed, Mar 8, 2017 at 4:16 PM, Josh Durgin  wrote:
> On 03/08/2017 02:15 PM, Kent Borg wrote:
>>
>> On 03/08/2017 05:08 PM, John Spray wrote:
>>>
>>> Specifically?
>>> I'm not saying you're wrong, but I am curious which bits in particular
>>> you missed.
>>>
>>
>> Object maps. Those transaction-y things. Object classes. Maybe more I
>> don't know about because I have been learning via Python.
>
>
> There are certainly gaps in the python bindings, but those are all
> covered since jewel.
>
> Hmm, you may have been confused by the docs website - I'd thought the
> reference section was autogenerated from the docstrings, like it is for
> librbd, but it's just static text: http://tracker.ceph.com/issues/19238
>
> For reference, take a look at 'help(rados)' from the python
> interpreter, or check out the source and tests:
>
> https://github.com/ceph/ceph/blob/jewel/src/pybind/rados/rados.pyx
> https://github.com/ceph/ceph/blob/jewel/src/test/pybind/test_rados.py
>
> Josh
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-09 Thread Kent Borg

On 03/09/2017 06:19 PM, Max Yehorov wrote:

There was definitely missing the "list watchers" command.


I was hearing something about how somebody doing locking partly via list 
watchers...but hadn't looked for it.


The Python librados I have been playing with was part of Debian. Today I 
built it from git sources and I do have a much more complete version. 
More complete underlying RADOS library, too.


Thanks for that pointer, Josh. Yes, my obsolete library matched the 
obsolete docs pretty well.


-kb

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-09 Thread Brad Hubbard
Can you explain more about what happened?

The query shows progress is blocked by the following OSDs.

"blocked_by": [
14,
17,
51,
58,
63,
64,
68,
70
],

Some of these OSDs are marked as "dne" (Does Not Exist).

peer": "17",
"dne": 1,
"peer": "51",
"dne": 1,
"peer": "58",
"dne": 1,
"peer": "64",
"dne": 1,
"peer": "70",
"dne": 1,

Can we get a complete background here please?


On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai  wrote:
> Hello,
>
> After a major network outage our ceph cluster ended up with an inactive PG:
>
> # ceph health detail
> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1
> requests are blocked > 32 sec; 1 osds have slow requests
> pg 3.367 is stuck inactive for 912263.766607, current state incomplete, last
> acting [28,35,2]
> pg 3.367 is stuck unclean for 912263.766688, current state incomplete, last
> acting [28,35,2]
> pg 3.367 is incomplete, acting [28,35,2]
> 1 ops are blocked > 268435 sec
> 1 ops are blocked > 268435 sec on osd.28
> 1 osds have slow requests
>
> # ceph -s
> cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
>  health HEALTH_WARN
> 1 pgs incomplete
> 1 pgs stuck inactive
> 1 pgs stuck unclean
> 1 requests are blocked > 32 sec
>  monmap e3: 3 mons at
> {tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
> election epoch 72, quorum 0,1,2 tv-dl360-1,tv-dl360-2,tv-dl360-3
>  osdmap e60609: 72 osds: 72 up, 72 in
>   pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778 objects
> 490 GB used, 130 TB / 130 TB avail
> 4863 active+clean
>1 incomplete
>   client io 0 B/s rd, 38465 B/s wr, 2 op/s
>
> ceph pg repair doesn't change anything. What should I try to recover it?
> Attached is the result of ceph pg query on the problem PG.
>
> Thank you,
> Laszlo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with RDMA

2017-03-09 Thread Haomai Wang
On Fri, Mar 10, 2017 at 4:28 AM, PR PR  wrote:
> Hi,
>
> I am trying to use ceph with RDMA. I have a few questions.
>
> 1. Is there a prebuilt package that has rdma support or the only way to try
> ceph+rdma is to checkout from github and compile from scratch?
>
> 2. Looks like there are two ways of using rdma - xio and async+rdma. Which
> is the recommended approach? Also, any insights on the differences will be
> useful as well.
>
> 3. async+rdma seems to have lot of recent changes. Is 11.2.0 expected to
> work for async+rdma? As when I compiled 11.2.0 it fails with following error
>

suggest checkout with master

> [ 81%] Built target rbd
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_free_device_list'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_get_cq_event'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_alloc_pd'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_close_device'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_destroy_qp'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_modify_qp'
> /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> `ibv_get_async_event'
> ***snipped***
> Link Error: Ceph FS library not found
> src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/build.make:57: recipe for
> target 'src/pybind/cephfs/CMakeFiles/cython_cephfs' failed
> make[2]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs] Error 1
> CMakeFiles/Makefile2:4015: recipe for target
> 'src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all' failed
> make[1]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all] Error 2
> make[1]: *** Waiting for unfinished jobs
> [ 85%] Built target rgw_a
>
> Thanks,
> PR
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: How does ceph preserve read/write consistency?

2017-03-09 Thread 许雪寒
Thanks for your reply.

As the log shows, in our test, a READ that come after a WRITE did finished 
before that WRITE. And I read the source code, it seems that, for writes, in 
ReplicatedPG::do_op method, the thread in OSD_op_tp calls 
ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it 
fails, the op will be put into obc->rwstate.waiters queue and be requeued when 
repop finishes, however, the OSD_op_tp's thread doesn't wait for repop and 
tries to get the next OP. Can this be the cause?

-邮件原件-
发件人: Wei Jin [mailto:wjin...@gmail.com] 
发送时间: 2017年3月9日 21:52
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] How does ceph preserve read/write consistency?

On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒  wrote:
> Hi, everyone.

> As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and 
> READ req with tid 6476 arrived at 18:59:55.030936, however, the latter 
> finished at 19:00:20:89 while the former finished commit at 
> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we 
> found that between the start and finish of each req, there was a lot of 
> "dequeue_op" of that req. We read the source code, it seems that this is due 
> to "RWState", is that correct?
>
> And also, it seems that OSD won't distinguish reqs from different clients, so 
> is it possible that io reqs from the same client also finish in a different 
> order than that they were created in? Could this affect the read/write 
> consistency? For instance, that a read can't acquire the data that were 
> written by the same client just before it.
>

IMO, that doesn't make sense for rados to distinguish reqs from different 
clients.
Clients or Users should do it by themselves.

However, as for one specific client, ceph can and must guarantee the request 
order.

1) ceph messenger (network layer) has in_seq and out_seq when receiving and 
sending message

2) message will be dispatched or fast dispatched and then be queued in 
ShardedOpWq in order.

If requests belong to different pgs, they may be processed concurrently, that's 
ok.

If requests belong to the same pg, they will be queued in the same shard and 
will be processed in order due to pg lock (both read and write).
For continuous write, op will be queued in ObjectStore in order due to pg lock 
and ObjectStore has OpSequence to guarantee the order when applying op to page 
cache, that's ok.

With regard to  'read after write' to the same object, ceph must guarantee read 
can get the correct write content. That's done by ondisk_read/write_lock in 
ObjectContext.


> We are testing hammer version, 0.94.5.  Please help us, thank you:-) 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: How does ceph preserve read/write consistency?

2017-03-09 Thread 许雪寒
I also submitted an issue: http://tracker.ceph.com/issues/19252

-邮件原件-
发件人: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] 代表 许雪寒
发送时间: 2017年3月10日 11:20
收件人: Wei Jin; ceph-users@lists.ceph.com
主题: [ceph-users] 答复: How does ceph preserve read/write consistency?

Thanks for your reply.

As the log shows, in our test, a READ that come after a WRITE did finished 
before that WRITE. And I read the source code, it seems that, for writes, in 
ReplicatedPG::do_op method, the thread in OSD_op_tp calls 
ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it 
fails, the op will be put into obc->rwstate.waiters queue and be requeued when 
repop finishes, however, the OSD_op_tp's thread doesn't wait for repop and 
tries to get the next OP. Can this be the cause?

-邮件原件-
发件人: Wei Jin [mailto:wjin...@gmail.com]
发送时间: 2017年3月9日 21:52
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] How does ceph preserve read/write consistency?

On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒  wrote:
> Hi, everyone.

> As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and 
> READ req with tid 6476 arrived at 18:59:55.030936, however, the latter 
> finished at 19:00:20:89 while the former finished commit at 
> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we 
> found that between the start and finish of each req, there was a lot of 
> "dequeue_op" of that req. We read the source code, it seems that this is due 
> to "RWState", is that correct?
>
> And also, it seems that OSD won't distinguish reqs from different clients, so 
> is it possible that io reqs from the same client also finish in a different 
> order than that they were created in? Could this affect the read/write 
> consistency? For instance, that a read can't acquire the data that were 
> written by the same client just before it.
>

IMO, that doesn't make sense for rados to distinguish reqs from different 
clients.
Clients or Users should do it by themselves.

However, as for one specific client, ceph can and must guarantee the request 
order.

1) ceph messenger (network layer) has in_seq and out_seq when receiving and 
sending message

2) message will be dispatched or fast dispatched and then be queued in 
ShardedOpWq in order.

If requests belong to different pgs, they may be processed concurrently, that's 
ok.

If requests belong to the same pg, they will be queued in the same shard and 
will be processed in order due to pg lock (both read and write).
For continuous write, op will be queued in ObjectStore in order due to pg lock 
and ObjectStore has OpSequence to guarantee the order when applying op to page 
cache, that's ok.

With regard to  'read after write' to the same object, ceph must guarantee read 
can get the correct write content. That's done by ondisk_read/write_lock in 
ObjectContext.


> We are testing hammer version, 0.94.5.  Please help us, thank you:-) 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-09 Thread Xavier Trilla
Hi,

We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO  operations are still quite slower than libaio.

Now with a  single thread read operations are about 1000 per second and write 
operations about 5000 per second.

Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second.

I'm compiling QEMU with jemalloc support as well, and I'm planning to replace 
librbd in QEMU hosts to the new one using jemalloc.

But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find.

Any help will be much appreciated.

Thanks.


De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: jueves, 9 de marzo de 2017 6:56
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Posix AIO vs libaio read performance

Hi,

I'm trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd.

The results I'm getting using FIO are:

POSIX AIO Read:

Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 2.54 MB/s
Average: 632 IOPS

Libaio Read:

Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 147.88 MB/s
Average: 36967 IOPS

When performing writes the differences aren't so big, because the cluster 
-which is in production right now- is CPU bonded:

POSIX AIO Write:

Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 14.87 MB/s
Average: 3713 IOPS

Libaio Write:

Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 14.51 MB/s
Average: 3622 IOPS


Even if the write results are CPU bonded, as the machines containing the OSDs 
don't have enough CPU to handle all the IOPS (CPU upgrades are on its way) I 
cannot really understand why I'm seeing so much difference in the read tests.

Some configuration background:

- Cluster and clients are using Hammer 0.94.90
- It's a full SSD cluster running over Samsung Enterprise SATA SSDs, with all 
the typical tweaks (Customized ceph.conf, optimized sysctl, etc...)
- Tried QEMU 2.0 and 2.7 - Similar results
- Tried virtio-blk and virtio-scsi - Similar results

I've been reading about POSIX AIO and Libaio, and I can see there are several 
differences on how they work (Like one being user space and the other one being 
kernel) but I don't really get why Ceph have such problems handling POSIX AIO 
read operations, but not write operation, and how to avoid them.

Right now I'm trying to identify if it's something wrong with our Ceph cluster 
setup, with Ceph in general or with QEMU (virtio-scsi or virtio-blk as both 
have the same behavior)

If you would like to try to reproduce the issue here are the two command lines 
I'm using:

fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32
fio --name=randread-libaio --output ./test --runtime 60 --ioengine=libaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32


If you could shed any light over this I would be really helpful, as right now, 
although I have still some ideas left to try, I'm don't have much idea about 
why is this happening...

Thanks!
Xavier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-09 Thread Alexandre DERUMIER

>>But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
>>manage to find.

you can improve latency on client with disable debug.

on your client, create a /etc/ceph/ceph.conf with

[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0


you can also disable rbd_cache=false   or in qemu set cache=none.

Using iothread on qemu drive should help a little bit too.

- Mail original -
De: "Xavier Trilla" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mars 2017 05:37:01
Objet: Re: [ceph-users] Posix AIO vs libaio read performance



Hi, 



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio. 



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second. 



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second. 



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc. 



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 



Any help will be much appreciated. 



Thanks. 






De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla 
Enviado el: jueves, 9 de marzo de 2017 6:56 
Para: ceph-users@lists.ceph.com 
Asunto: [ceph-users] Posix AIO vs libaio read performance 




Hi, 



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd. 



The results I’m getting using FIO are: 



POSIX AIO Read: 



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 2.54 MB/s 

Average: 632 IOPS 



Libaio Read: 



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 147.88 MB/s 

Average: 36967 IOPS 



When performing writes the differences aren’t so big, because the cluster 
–which is in production right now- is CPU bonded: 



POSIX AIO Write: 



Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 14.87 MB/s 

Average: 3713 IOPS 



Libaio Write: 



Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 14.51 MB/s 

Average: 3622 IOPS 





Even if the write results are CPU bonded, as the machines containing the OSDs 
don’t have enough CPU to handle all the IOPS (CPU upgrades are on its way) I 
cannot really understand why I’m seeing so much difference in the read tests. 



Some configuration background: 



- Cluster and clients are using Hammer 0.94.90 

- It’s a full SSD cluster running over Samsung Enterprise SATA SSDs, with all 
the typical tweaks (Customized ceph.conf, optimized sysctl, etc…) 

- Tried QEMU 2.0 and 2.7 – Similar results 

- Tried virtio-blk and virtio-scsi – Similar results 



I’ve been reading about POSIX AIO and Libaio, and I can see there are several 
differences on how they work (Like one being user space and the other one being 
kernel) but I don’t really get why Ceph have such problems handling POSIX AIO 
read operations, but not write operation, and how to avoid them. 



Right now I’m trying to identify if it’s something wrong with our Ceph cluster 
setup, with Ceph in general or with QEMU (virtio-scsi or virtio-blk as both 
have the same behavior) 



If you would like to try to reproduce the issue here are the two command lines 
I’m using: 



fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 

fio --name=randread-libaio --output ./test --runtime 60 --ioengine=libaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 





If you could shed any light over this I would be really helpful, as right now, 
although I have still some ideas left to try, I’m don’t have much idea about 
why is this happening… 



Thanks! 

Xavier 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___