[ceph-users] radosgw hanging - blocking "rgw.bucket_list" ops

2015-08-21 Thread Sam Wouters
Hi,

We are running hammer 0.94.2 and have an increasing amount of
"heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
timed out after 600" messages in our radosgw logs, with radosgw
eventually stalling. A restart of the radosgw helps for a few minutes,
but after that it hangs again.

"ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests" shows
"call rgw.bucket_list" ops. No new bucket lists are requested, so those
ops seem to stay there. Anyone any idea how to get rid of those. Restart
of the affecting osd didn't help neither.

I'm not sure if its related, but we have an object called "_sanity" in
the bucket where the listing was performed on. I know there is some bug
with objects starting with "_".

Any help would be much appreciated.

r,
Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw hanging - blocking "rgw.bucket_list" ops

2015-08-21 Thread Sam Wouters
I suspect these to be the cause:

rados ls -p .be-east.rgw.buckets | grep
sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity   
be-east.5436.1__sanity
be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity   
be-east.5436.1__sanity
be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity   
be-east.5436.1__sanity
be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity   
be-east.5436.1__sanity
be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity   
be-east.5436.1__sanity
be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity   
be-east.5436.1__sanity
be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity   
be-east.5436.1__sanity
be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity   
be-east.5436.1__sanity
be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity   
be-east.5436.1__sanity
be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity   
be-east.5436.1__sanity
be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity   
be-east.5436.1__sanity
be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity   
be-east.5436.1__sanity
be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity   
be-east.5436.1__sanity
be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity   
be-east.5436.1__sanity
be-east.5436.1___sanitybe-east.5436.1__sanity

would it be save and/or help to remove those with "rados rm", and try an
bucket check --fix --check-objects?

On 21-08-15 11:28, Sam Wouters wrote:
> Hi,
>
> We are running hammer 0.94.2 and have an increasing amount of
> "heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
> timed out after 600" messages in our radosgw logs, with radosgw
> eventually stalling. A restart of the radosgw helps for a few minutes,
> but after that it hangs again.
>
> "ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests" shows
> "call rgw.bucket_list" ops. No new bucket lists are requested, so those
> ops seem to stay there. Anyone any idea how to get rid of those. Restart
> of the affecting osd didn't help neither.
>
> I'm not sure if its related, but we have an object called "_sanity" in
> the bucket where the listing was performed on. I know there is some bug
> with objects starting with "_".
>
> Any help would be much appreciated.
>
> r,
> Sam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Testing CephFS

2015-08-21 Thread Gregory Farnum
On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam  wrote:
> Hey all,
>
>
>
> We are currently testing CephFS on a small (3 node) cluster.
>
>
>
> The setup is currently:
>
>
>
> Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:
>
> The servers are running: 0.94.2-0.el7
>
> The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64
>
>
>
> ceph -s
>
> cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
>
>  health HEALTH_OK
>
>  monmap e1: 3 mons at
> {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
>
> election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3
>
>  mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby
>
>  osdmap e389: 36 osds: 36 up, 36 in
>
>   pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects
>
> 95526 MB used, 196 TB / 196 TB avail
>
> 8256 active+clean
>
>
>
> Our Ceph.conf is relatively simple at the moment:
>
>
>
> cat /etc/ceph/ceph.conf
>
> [global]
>
> fsid = 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
>
> mon_initial_members = ceph1, ceph2, ceph3
>
> mon_host = 10.15.0.1,10.15.0.2,10.15.0.3
>
> mon_pg_warn_max_per_osd = 1000
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> filestore_xattr_use_omap = true
>
> osd_pool_default_size = 2
>
>
>
> When I pulled the plug on the master MDS last time (ceph1), it stopped all
> IO until I plugged it back in. I was under the assumption that the MDS would
> fail over the other 2 MDS’s and IO would continue?
>
>
>
> Is there something I need to do to allow the MDS’s to failover from each
> other without too much interruption? Or is this because the clients ceph
> version?

That's quite strange. How long did you wait for it to fail over? Did
the output of "ceph -s" (or "ceph -w", whichever) change during that
time?
By default the monitors should have detected the MDS was dead after 30
seconds and put one of the other MDS nodes into replay and active.

...I wonder if this is because you lost a monitor at the same time as
the MDS. What kind of logging do you have available from during your
test?
-Greg

>
>
>
> Cheers,
>
>
>
> Simon Hallam
>
> Linux Support & Development Officer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw hanging - blocking "rgw.bucket_list" ops

2015-08-21 Thread Sam Wouters
tried removing, but no luck:

rados -p .be-east.rgw.buckets rm
"be-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity"
error removing
.be-east.rgw.buckets>be-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity:
(2)

anyone?

On 21-08-15 13:06, Sam Wouters wrote:
> I suspect these to be the cause:
>
> rados ls -p .be-east.rgw.buckets | grep
> sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity   
> be-east.5436.1__sanity
> be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity   
> be-east.5436.1__sanity
> be-east.5436.1___sanitybe-east.5436.1__sanity
>
> would it be save and/or help to remove those with "rados rm", and try an
> bucket check --fix --check-objects?
>
> On 21-08-15 11:28, Sam Wouters wrote:
>> Hi,
>>
>> We are running hammer 0.94.2 and have an increasing amount of
>> "heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
>> timed out after 600" messages in our radosgw logs, with radosgw
>> eventually stalling. A restart of the radosgw helps for a few minutes,
>> but after that it hangs again.
>>
>> "ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests" shows
>> "call rgw.bucket_list" ops. No new bucket lists are requested, so those
>> ops seem to stay there. Anyone any idea how to get rid of those. Restart
>> of the affecting osd didn't help neither.
>>
>> I'm not sure if its related, but we have an object called "_sanity" in
>> the bucket where the listing was performed on. I know there is some bug
>> with objects starting with "_".
>>
>> Any help would be much appreciated.
>>
>> r,
>> Sam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread J-P Methot
Hi,

First of all, we are sure that the return to the default configuration
fixed it. As soon as we restarted only one of the ceph nodes with the
default configuration, it sped up recovery tremedously. We had already
restarted before with the old conf and recovery was never that fast.

Regarding the configuration, here's the old one with comments :

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = ***
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true   //
  Let's you use xattributes of xfs/ext4/btrfs filesystems
osd_pool_default_pgp_num = 450   //
default pgp number for new pools
osd_pg_bits = 12  //
 number of bits used to designate pgps. Lets you have 2^12 pgps
osd_pool_default_size = 3   //
 default copy number for new pools
osd_pool_default_pg_num = 450//
default pg number for new pools
public_network = *
cluster_network = ***
osd_pgp_bits = 12   //
 number of bits used to designate pgps. Let's you have 2^12 pgps

[osd]
filestore_queue_max_ops = 5000// set to 500 by default Defines the
maximum number of in progress operations the file store accepts before
blocking on queuing new operations.
filestore_fd_cache_random = true//  
journal_queue_max_ops = 100   //   set
to 500 by default. Number of operations allowed in the journal queue
filestore_omap_header_cache_size = 100  //   Determines
the size of the LRU used to cache object omap headers. Larger values use
more memory but may reduce lookups on omap.
filestore_fd_cache_size = 100 //
not in the ceph documentation. Seems to be a common tweak for SSD
clusters though.
max_open_files = 100 //
  lets ceph set the max file descriptor in the OS to prevent running out
of file descriptors
osd_journal_size = 1   //
journal max size for each OSD

New conf:

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = **
cluster_network = **

You might notice, I have a few undocumented settings in the old
configuration. These are settings I took from a certain openstack summit
presentation and they may have contributed to this whole problem. Here's
a list of settings that I think might be a possible cause for these
speed issues:

filestore_fd_cache_random = true
filestore_fd_cache_size = 100

Additionally, my colleague thinks these settings may have contributed :

filestore_queue_max_ops = 5000
journal_queue_max_ops = 100

We will do further tests on these settings once we have our lab ceph
test environment as we are also curious as to exactly what caused this.


On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
>>
>> Just to update the mailing list, we ended up going back to default
>> ceph.conf without any additional settings than what is mandatory. We are
>> now reaching speeds we never reached before, both in recovery and in
>> regular usage. There was definitely something we set in the ceph.conf
>> bogging everything down.
> 
> Could you please share the old and new ceph.conf, or the section that
> was removed?
> 
> Best regards,
> Alex
> 
>>
>>
>> On 2015-08-20 4:06 AM, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> from all the pertinent points by Somnath, the one about pre-conditioning
>>> would be pretty high on my list, especially if this slowness persists and
>>> nothing else (scrub) is going on.
>>>
>>> This might be "fixed" by doing a fstrim.
>>>
>>> Additionally the levelDB's per OSD are of course sync'ing heavily during
>>> reconstruction, so that might not be the favorite thing for your type of
>>> SSDs.
>>>
>>> But ultimately situational awareness is very important, as in "what" is
>>> actually going and slowing things down.
>>> As usual my recommendations would be to use atop, iostat or similar on all
>>> your nodes and see if your OSD SSDs are indeed the bottleneck or if it is
>>> maybe just one of them or something else entirely.
>>>
>>> Christian
>>>
>>> On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote:
>>>
 Also, check if scrubbing started in the cluster or not. That may
 considerably slow down the cluster.

 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, August 19, 2015 1:35 PM
 To: 'J-P Methot'; ceph-us...@ceph.com
 Subject: RE: [ceph-users] Bad performances in recovery

 All the writes will go through the journal.
 It may happen your SSDs are not precon

Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread Jan Schermer
Thanks for the config,
few comments inline:, not really related to the issue 

> On 21 Aug 2015, at 15:12, J-P Methot  wrote:
> 
> Hi,
> 
> First of all, we are sure that the return to the default configuration
> fixed it. As soon as we restarted only one of the ceph nodes with the
> default configuration, it sped up recovery tremedously. We had already
> restarted before with the old conf and recovery was never that fast.
> 
> Regarding the configuration, here's the old one with comments :
> 
> [global]
> fsid = *
> mon_initial_members = cephmon1
> mon_host = ***
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true   //
>  Let's you use xattributes of xfs/ext4/btrfs filesystems

This actually did the opposite, but this option doesn't exist anymore

> osd_pool_default_pgp_num = 450   //
> default pgp number for new pools
> osd_pg_bits = 12  //
> number of bits used to designate pgps. Lets you have 2^12 pgps

Could someone comment on those? What exactly does it do? What if I have more 
PGs than num_osds*osd_pg_bits?

> osd_pool_default_size = 3   //
> default copy number for new pools
> osd_pool_default_pg_num = 450//
> default pg number for new pools
> public_network = *
> cluster_network = ***
> osd_pgp_bits = 12   //
> number of bits used to designate pgps. Let's you have 2^12 pgps
> 
> [osd]
> filestore_queue_max_ops = 5000// set to 500 by default Defines the
> maximum number of in progress operations the file store accepts before
> blocking on queuing new operations.
> filestore_fd_cache_random = true//  

No docs, I don't see this in my ancient cluster :-)

> journal_queue_max_ops = 100   //   set
> to 500 by default. Number of operations allowed in the journal queue
> filestore_omap_header_cache_size = 100  //   Determines
> the size of the LRU used to cache object omap headers. Larger values use
> more memory but may reduce lookups on omap.
> filestore_fd_cache_size = 100 //

You don't really need to set this so high, but not sure what the implications 
are if you go too high (it probably doesn't eat more memory until it opens so 
many files). If you have 4MB object on a 1TB drive than you really only need 
250K to keep all files open.
> not in the ceph documentation. Seems to be a common tweak for SSD
> clusters though.
> max_open_files = 100 //
>  lets ceph set the max file descriptor in the OS to prevent running out
> of file descriptors

This is too low if you were really using all of the fd_cache. There are going 
to be thousands of tcp connection which need to be accounted for as well.
(in my experience there can be hundreds to thousands tcp connection from just 
one RBD client and 200 OSDs, which is a lot).


> osd_journal_size = 1   //
>journal max size for each OSD
> 
> New conf:
> 
> [global]
> fsid = *
> mon_initial_members = cephmon1
> mon_host = 
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public_network = **
> cluster_network = **
> 
> You might notice, I have a few undocumented settings in the old
> configuration. These are settings I took from a certain openstack summit
> presentation and they may have contributed to this whole problem. Here's
> a list of settings that I think might be a possible cause for these
> speed issues:
> 
> filestore_fd_cache_random = true
> filestore_fd_cache_size = 100
> 
> Additionally, my colleague thinks these settings may have contributed :
> 
> filestore_queue_max_ops = 5000
> journal_queue_max_ops = 100
> 
> We will do further tests on these settings once we have our lab ceph
> test environment as we are also curious as to exactly what caused this.
> 
> 
> On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
>>> 
>>> Just to update the mailing list, we ended up going back to default
>>> ceph.conf without any additional settings than what is mandatory. We are
>>> now reaching speeds we never reached before, both in recovery and in
>>> regular usage. There was definitely something we set in the ceph.conf
>>> bogging everything down.
>> 
>> Could you please share the old and new ceph.conf, or the section that
>> was removed?
>> 
>> Best regards,
>> Alex
>> 
>>> 
>>> 
>>> On 2015-08-20 4:06 AM, Christian Balzer wrote:
 
 Hello,
 
 from all the pertinent points by Somnath, the one about pre-conditioning
 would be pretty high on my list, especially if this slowness persists a

Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread Shinobu Kinjo
> filestore_fd_cache_random = true

not true

Shinobu

On Fri, Aug 21, 2015 at 10:20 PM, Jan Schermer  wrote:

> Thanks for the config,
> few comments inline:, not really related to the issue
>
> > On 21 Aug 2015, at 15:12, J-P Methot  wrote:
> >
> > Hi,
> >
> > First of all, we are sure that the return to the default configuration
> > fixed it. As soon as we restarted only one of the ceph nodes with the
> > default configuration, it sped up recovery tremedously. We had already
> > restarted before with the old conf and recovery was never that fast.
> >
> > Regarding the configuration, here's the old one with comments :
> >
> > [global]
> > fsid = *
> > mon_initial_members = cephmon1
> > mon_host = ***
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true   //
> >  Let's you use xattributes of xfs/ext4/btrfs filesystems
>
> This actually did the opposite, but this option doesn't exist anymore
>
> > osd_pool_default_pgp_num = 450   //
> > default pgp number for new pools
> > osd_pg_bits = 12  //
> > number of bits used to designate pgps. Lets you have 2^12 pgps
>
> Could someone comment on those? What exactly does it do? What if I have
> more PGs than num_osds*osd_pg_bits?
>
> > osd_pool_default_size = 3   //
> > default copy number for new pools
> > osd_pool_default_pg_num = 450//
> > default pg number for new pools
> > public_network = *
> > cluster_network = ***
> > osd_pgp_bits = 12   //
> > number of bits used to designate pgps. Let's you have 2^12 pgps
> >
> > [osd]
> > filestore_queue_max_ops = 5000// set to 500 by default Defines the
> > maximum number of in progress operations the file store accepts before
> > blocking on queuing new operations.
> > filestore_fd_cache_random = true//  
>
> No docs, I don't see this in my ancient cluster :-)
>
> > journal_queue_max_ops = 100   //   set
> > to 500 by default. Number of operations allowed in the journal queue
> > filestore_omap_header_cache_size = 100  //   Determines
> > the size of the LRU used to cache object omap headers. Larger values use
> > more memory but may reduce lookups on omap.
> > filestore_fd_cache_size = 100 //
>
> You don't really need to set this so high, but not sure what the
> implications are if you go too high (it probably doesn't eat more memory
> until it opens so many files). If you have 4MB object on a 1TB drive than
> you really only need 250K to keep all files open.
> > not in the ceph documentation. Seems to be a common tweak for SSD
> > clusters though.
> > max_open_files = 100 //
> >  lets ceph set the max file descriptor in the OS to prevent running out
> > of file descriptors
>
> This is too low if you were really using all of the fd_cache. There are
> going to be thousands of tcp connection which need to be accounted for as
> well.
> (in my experience there can be hundreds to thousands tcp connection from
> just one RBD client and 200 OSDs, which is a lot).
>
>
> > osd_journal_size = 1   //
> >journal max size for each OSD
> >
> > New conf:
> >
> > [global]
> > fsid = *
> > mon_initial_members = cephmon1
> > mon_host = 
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > public_network = **
> > cluster_network = **
> >
> > You might notice, I have a few undocumented settings in the old
> > configuration. These are settings I took from a certain openstack summit
> > presentation and they may have contributed to this whole problem. Here's
> > a list of settings that I think might be a possible cause for these
> > speed issues:
> >
> > filestore_fd_cache_random = true
> > filestore_fd_cache_size = 100
> >
> > Additionally, my colleague thinks these settings may have contributed :
> >
> > filestore_queue_max_ops = 5000
> > journal_queue_max_ops = 100
> >
> > We will do further tests on these settings once we have our lab ceph
> > test environment as we are also curious as to exactly what caused this.
> >
> >
> > On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
> >>>
> >>> Just to update the mailing list, we ended up going back to default
> >>> ceph.conf without any additional settings than what is mandatory. We
> are
> >>> now reaching speeds we never reached before, both in recovery and in
> >>> regular usage. There was definitely something we set in the ceph.conf
> >>> bogging everything down.
> >>
> >> Could you please share the old and new ceph.c

Re: [ceph-users] Rados: Undefined symbol error

2015-08-21 Thread Jason Dillaman
It sounds like you have rados CLI tool from an earlier Ceph release (< Hammer) 
installed and it is attempting to use the librados shared library from a newer 
(>= Hammer) version of Ceph.

Jason 


- Original Message - 

> From: "Aakanksha Pudipeddi-SSI" 
> To: ceph-us...@ceph.com
> Sent: Thursday, August 20, 2015 11:47:26 PM
> Subject: [ceph-users] Rados: Undefined symbol error

> Hello,

> I cloned the master branch of Ceph and after setting up the cluster, when I
> tried to use the rados commands, I got this error:

> rados: symbol lookup error: rados: undefined symbol:
> _ZN5MutexC1ERKSsbbbP11CephContext

> I saw a similar post here: http://tracker.ceph.com/issues/12563 but I am not
> clear on the solution for this problem. I am not performing an upgrade here
> but the error seems to be similar. Could anybody shed more light on the
> issue and how to solve it? Thanks a lot!

> Aakanksha

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Samuel Just
Odd, did you happen to capture osd logs?
-Sam

On Thu, Aug 20, 2015 at 8:10 PM, Ilya Dryomov  wrote:
> On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just  wrote:
>> What's supposed to happen is that the client transparently directs all
>> requests to the cache pool rather than the cold pool when there is a
>> cache pool.  If the kernel is sending requests to the cold pool,
>> that's probably where the bug is.  Odd.  It could also be a bug
>> specific 'forward' mode either in the client or on the osd.  Why did
>> you have it in that mode?
>
> I think I reproduced this on today's master.
>
> Setup, cache mode is writeback:
>
> $ ./ceph osd pool create foo 12 12
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> pool 'foo' created
> $ ./ceph osd pool create foo-hot 12 12
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> pool 'foo-hot' created
> $ ./ceph osd tier add foo foo-hot
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> pool 'foo-hot' is now (or already was) a tier of 'foo'
> $ ./ceph osd tier cache-mode foo-hot writeback
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> set cache-mode for pool 'foo-hot' to writeback
> $ ./ceph osd tier set-overlay foo foo-hot
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> overlay for 'foo' is now (or already was) 'foo-hot'
>
> Create an image:
>
> $ ./rbd create --size 10M --image-format 2 foo/bar
> $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
> $ sudo mkfs.ext4 /mnt/bar
> $ sudo umount /mnt
>
> Create a snapshot, take md5sum:
>
> $ ./rbd snap create foo/bar@snap
> $ ./rbd export foo/bar /tmp/foo-1
> Exporting image: 100% complete...done.
> $ ./rbd export foo/bar@snap /tmp/snap-1
> Exporting image: 100% complete...done.
> $ md5sum /tmp/foo-1
> 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-1
> $ md5sum /tmp/snap-1
> 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-1
>
> Set the cache mode to forward and do a flush, hashes don't match - the
> snap is empty - we bang on the hot tier and don't get redirected to the
> cold tier, I suspect:
>
> $ ./ceph osd tier cache-mode foo-hot forward
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> set cache-mode for pool 'foo-hot' to forward
> $ ./rados -p foo-hot cache-flush-evict-all
> rbd_data.100a6b8b4567.0002
> rbd_id.bar
> rbd_directory
> rbd_header.100a6b8b4567
> bar.rbd
> rbd_data.100a6b8b4567.0001
> rbd_data.100a6b8b4567.
> $ ./rados -p foo-hot cache-flush-evict-all
> $ ./rbd export foo/bar /tmp/foo-2
> Exporting image: 100% complete...done.
> $ ./rbd export foo/bar@snap /tmp/snap-2
> Exporting image: 100% complete...done.
> $ md5sum /tmp/foo-2
> 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-2
> $ md5sum /tmp/snap-2
> f1c9645dbc14efddc7d8a322685f26eb  /tmp/snap-2
> $ od /tmp/snap-2
> 000 00 00 00 00 00 00 00 00
> *
> 5000
>
> Disable the cache tier and we are back to normal:
>
> $ ./ceph osd tier remove-overlay foo
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> there is now (or already was) no overlay for 'foo'
> $ ./rbd export foo/bar /tmp/foo-3
> Exporting image: 100% complete...done.
> $ ./rbd export foo/bar@snap /tmp/snap-3
> Exporting image: 100% complete...done.
> $ md5sum /tmp/foo-3
> 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-3
> $ md5sum /tmp/snap-3
> 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-3
>
> I first reproduced it with the kernel client, rbd export was just to
> take it out of the equation.
>
>
> Also, Igor sort of raised a question in his second message: if, after
> setting the cache mode to forward and doing a flush, I open an image
> (not a snapshot, so may not be related to the above) for write (e.g.
> with rbd-fuse), I get an rbd header object in the hot pool, even though
> it's in forward mode:
>
> $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
> $ sudo mount /mnt/bar /media
> $ sudo umount /media
> $ sudo umount /mnt
> $ ./rados -p foo-hot ls
> rbd_header.100a6b8b4567
> $ ./rados -p foo ls | grep rbd_header
> rbd_header.100a6b8b4567
>
> It's been a while since I looked into tiering, is that how it's
> supposed to work?  It looks like it happens because rbd_header op
> replies don't redirect?
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Ilya Dryomov
On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just  wrote:
> Odd, did you happen to capture osd logs?

No, but the reproducer is trivial to cut & paste.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Samuel Just
I think I found the bug -- need to whiteout the snapset (or decache
it) upon evict.

http://tracker.ceph.com/issues/12748
-Sam

On Fri, Aug 21, 2015 at 8:04 AM, Ilya Dryomov  wrote:
> On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just  wrote:
>> Odd, did you happen to capture osd logs?
>
> No, but the reproducer is trivial to cut & paste.
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s

2015-08-21 Thread Sean
We heavily use radosgw here for most of our work and we have seen a 
weird truncation issue with radosgw/s3 requests.


We have noticed that if the time between the initial "ticket" to grab 
the object key and grabbing the data is greater than 90 seconds the 
object returned is truncated to whatever RGW has grabbed/cached after 
the initial connection and this seems to be around 512k.


Here is some PoC. This will work on most objects I have tested mostly 1G 
to 5G keys in RGW::




#!/usr/bin/env python

import os
import sys
import json
import time

import boto
import boto.s3.connection

if __name__ == '__main__':
import argparse

parser = argparse.ArgumentParser(description='Delayed download.')

parser.add_argument('credentials', type=argparse.FileType('r'),
help='Credentials file.')

parser.add_argument('endpoint')
parser.add_argument('bucket')
parser.add_argument('key')

args = parser.parse_args()

credentials= json.load(args.credentials)[args.endpoint]

conn = boto.connect_s3(
aws_access_key_id = credentials.get('access_key'),
aws_secret_access_key = credentials.get('secret_key'),
host  = credentials.get('host'),
port  = credentials.get('port'),
is_secure = credentials.get('is_secure',False),
calling_format= boto.s3.connection.OrdinaryCallingFormat(),
)

key = conn.get_bucket(args.bucket).get_key(args.key)

key.BufferSize = 1048576
key.open_read(headers={})
time.sleep(120)

key.get_contents_to_file(sys.stdout)



The format of the credentials file is just standard::

=
=
{
 "cluster": {
"access_key": "blahblahblah",
"secret_key": "blahblahblah",
"host": "blahblahblah",
"port": "443",
"is_secure": true
}
}

=
=


From here your object will almost always be truncated to whatever the 
gateway has cached in the time after the initial key request.


This can be a huge issue as if the radosgw or cluster is tasked some 
requests can be minutes long. You can end up grabbing the rest of the 
object by doing a range request against the gateway so I know the data 
is intact but I don't think the radosgw should be acting as if the 
download is completed successfully and I think it should instead return 
an error of some kind if it can no longer service the request.


We are using hammer (ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our 
gateway.


This is on a 3 node test cluster but I have tried on our larger cluster 
with the same behavior. If I can provide any other information please 
let me know.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Scottix
I saw this article on Linux Today and immediately thought of Ceph.

http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html

I was thinking would it theoretically be possible with RGW to do a GET and
set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the
file.

The other option to append data to a RGW object instead of rewriting the
entire object.
And so on...

Just food for thought.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Gregory Farnum
On Fri, Aug 21, 2015 at 10:27 PM, Scottix  wrote:
> I saw this article on Linux Today and immediately thought of Ceph.
>
> http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html
>
> I was thinking would it theoretically be possible with RGW to do a GET and
> set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file.
>
> The other option to append data to a RGW object instead of rewriting the
> entire object.
> And so on...
>
> Just food for thought.

Raw RADOS (ie, librados users) get access significantly more powerful
than what he's describing in that article. :) I don't know if anybody
will ever punch more of that functionality through RGW or not.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Shouldn't this already be possible with HTTP Range requests? I don't
work with RGW or S3 so please ignore me if I'm talking crazy.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Aug 21, 2015 at 3:27 PM, Scottix  wrote:
> I saw this article on Linux Today and immediately thought of Ceph.
>
> http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html
>
> I was thinking would it theoretically be possible with RGW to do a GET and
> set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file.
>
> The other option to append data to a RGW object instead of rewriting the
> entire object.
> And so on...
>
> Just food for thought.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV15/ICRDmVDuy+mK58QAAnkAP/3q804Y7xJDqadNxFjWd
A1hzTcRfN6oqzZCf0T8stteTTG93Jt1R01ae2ZoVCM8EsefbovaPX68qy6kC
sw4JN+G9h2Ow01X5nWD1mvQIPde0+kdTqK6jejTPr8tWQ/J1/98kkkqH4FGp
TI3bOVBHik38RMt1G+yzVOS8E2lmckujzUsoQqA8kOyodsglQqAVj3kD8KAc
me+BlcOvZhP2eV0Tg8FtAjaUp22bJbh/V+a2ycwoNKKS5YsiP3bQHbaI8FAK
DYzndaS6UiwAhYjszmADRCqLXfmo8KkNYCr6xzr8oHSdPR33V87eFnkkaNmX
pkGSuwblA19QT0PiVan8B5XRUd7HcdcjUPrbGtjmRsrF2QtzHD+Fda6qw48/
TljMye6rnMX6A87UuIVpIj33OZiJRdiFwjMXQuSWCMl7WIYXU75KZKR5rsss
zX6NRIF3tSq0TBjcOFQN3+531XuCgsjwe3/zu2f1a/1JaGMAmMCO6vMdPhxU
dgkk31Ou7BbIuOzZmfagnNvRSdNLu5AUXZLlu5D+BhrH28kxzW0fXtoqyqU5
tGk83pP+sr6sJaAk4nfzEQWLE8LHxtkS21CE5Aa0u1av9Sg0T5R84hYfPw+W
skc67t2TVPHnphuLF2x2+xPArG3Ghuf2qD2Roz6zwkhpKQVprI8eiuu1lIfd
Yl/b
=w+bI
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s

2015-08-21 Thread Ben Hines
I just tried this (with some smaller objects, maybe 4.5 MB, as well as
with a 16 GB file and it worked fine.

However, i am using apache + fastcgi interface to rgw, rather than civetweb.

-Ben

On Fri, Aug 21, 2015 at 12:19 PM, Sean  wrote:
> We heavily use radosgw here for most of our work and we have seen a weird
> truncation issue with radosgw/s3 requests.
>
> We have noticed that if the time between the initial "ticket" to grab the
> object key and grabbing the data is greater than 90 seconds the object
> returned is truncated to whatever RGW has grabbed/cached after the initial
> connection and this seems to be around 512k.
>
> Here is some PoC. This will work on most objects I have tested mostly 1G to
> 5G keys in RGW::
>
> 
> 
> #!/usr/bin/env python
>
> import os
> import sys
> import json
> import time
>
> import boto
> import boto.s3.connection
>
> if __name__ == '__main__':
> import argparse
>
> parser = argparse.ArgumentParser(description='Delayed download.')
>
> parser.add_argument('credentials', type=argparse.FileType('r'),
> help='Credentials file.')
>
> parser.add_argument('endpoint')
> parser.add_argument('bucket')
> parser.add_argument('key')
>
> args = parser.parse_args()
>
> credentials= json.load(args.credentials)[args.endpoint]
>
> conn = boto.connect_s3(
> aws_access_key_id = credentials.get('access_key'),
> aws_secret_access_key = credentials.get('secret_key'),
> host  = credentials.get('host'),
> port  = credentials.get('port'),
> is_secure = credentials.get('is_secure',False),
> calling_format= boto.s3.connection.OrdinaryCallingFormat(),
> )
>
> key = conn.get_bucket(args.bucket).get_key(args.key)
>
> key.BufferSize = 1048576
> key.open_read(headers={})
> time.sleep(120)
>
> key.get_contents_to_file(sys.stdout)
> 
> 
>
> The format of the credentials file is just standard::
>
> =
> =
> {
>  "cluster": {
> "access_key": "blahblahblah",
> "secret_key": "blahblahblah",
> "host": "blahblahblah",
> "port": "443",
> "is_secure": true
> }
> }
>
> =
> =
>
>
> From here your object will almost always be truncated to whatever the
> gateway has cached in the time after the initial key request.
>
> This can be a huge issue as if the radosgw or cluster is tasked some
> requests can be minutes long. You can end up grabbing the rest of the object
> by doing a range request against the gateway so I know the data is intact
> but I don't think the radosgw should be acting as if the download is
> completed successfully and I think it should instead return an error of some
> kind if it can no longer service the request.
>
> We are using hammer (ceph version 0.94.2
> (5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our
> gateway.
>
> This is on a 3 node test cluster but I have tried on our larger cluster with
> the same behavior. If I can provide any other information please let me
> know.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD GHz vs. Cores Question

2015-08-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are looking to purchase our next round of Ceph hardware and based
off the work by Nick Fisk [1] our previous thought of cores over clock
is being revisited.

I have two camps of thoughts and would like to get some feedback, even
if it is only theoretical. We currently have 12 disks per node (2
SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
for journals and cache tier (when [2] and fstrim are resolved). We
also want to stay with a single processor for cost, power and NUMA
considerations.

1. For 12 disks with three threads each (2 client and 1 background),
lots of slower cores would allow I/O (ceph code) to be scheduled as
soon as a core is available.

2. Faster cores would get through the Ceph code faster but there would
be less cores and so some I/O may have to wait to be scheduled.

I'm leaning towards #2 for these reasons, please expose anything I may
be missing:
* The latency will only really be improved in the SSD I/O with faster
clock speed, all writes and any reads from the cache tier. So 8 fast
cores might be sufficient, reading from spindle and flushing the
journal will have a "substantial" amount of sleep to allow other Ceph
I/O to be hyperthreaded.
* Even though SSDs are much faster than spindles they are still orders
of magnitude slower than the processor, so it is still possible to get
more lines of code executed between SSD I/O with a faster processor
even with less cores.
* As the Ceph code is improved through optimization and less code has
to be executed for each I/O, faster clock speeds will only provide
even more benefit (lower latency, less waiting for cores) as the delay
shifts more from CPU to disk.

Since our workload is typically small I/O 12K-18K, latency means a lot
to our performance.

Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz

[1] http://www.spinics.net/lists/ceph-users/msg19305.html
[2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0PmS
CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb
ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw
xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP
oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOFX
6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezwv
9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8
LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq
XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOnM
LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw
EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k
OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o
kUd4
=F5Sx
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD GHz vs. Cores Question

2015-08-21 Thread Mark Nelson
FWIW, we recently were looking at a couple of different options for the 
machines in our test lab that run the nightly QA suite jobs via teuthology.


From a cost/benefit perspective, I think it really comes down to 
something like a XEON E3-12XXv3 or the new XEON D-1540, each of which 
have advantages/disadvantages.


We were very tempted by the Xeon D but it was still just a little too 
new for us so we ended up going with servers using more standard E3 
processors.  The Xeon D setup was slightly cheaper, offers more 
theoretical performance, and is way lower power, but at a much slower 
per-core clock speed.  It's likely that for our functional tests that 
clock speed may be more important than the cores (but on these machines 
we'll only have 4 OSDs per server).


Anyway, I suspect that either setup will probably work fairly well for 
spinners.  SSDs get trickier.


Mark

On 08/21/2015 05:46 PM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are looking to purchase our next round of Ceph hardware and based
off the work by Nick Fisk [1] our previous thought of cores over clock
is being revisited.

I have two camps of thoughts and would like to get some feedback, even
if it is only theoretical. We currently have 12 disks per node (2
SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
for journals and cache tier (when [2] and fstrim are resolved). We
also want to stay with a single processor for cost, power and NUMA
considerations.

1. For 12 disks with three threads each (2 client and 1 background),
lots of slower cores would allow I/O (ceph code) to be scheduled as
soon as a core is available.

2. Faster cores would get through the Ceph code faster but there would
be less cores and so some I/O may have to wait to be scheduled.

I'm leaning towards #2 for these reasons, please expose anything I may
be missing:
* The latency will only really be improved in the SSD I/O with faster
clock speed, all writes and any reads from the cache tier. So 8 fast
cores might be sufficient, reading from spindle and flushing the
journal will have a "substantial" amount of sleep to allow other Ceph
I/O to be hyperthreaded.
* Even though SSDs are much faster than spindles they are still orders
of magnitude slower than the processor, so it is still possible to get
more lines of code executed between SSD I/O with a faster processor
even with less cores.
* As the Ceph code is improved through optimization and less code has
to be executed for each I/O, faster clock speeds will only provide
even more benefit (lower latency, less waiting for cores) as the delay
shifts more from CPU to disk.

Since our workload is typically small I/O 12K-18K, latency means a lot
to our performance.

Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz

[1] http://www.spinics.net/lists/ceph-users/msg19305.html
[2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0PmS
CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb
ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw
xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP
oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOFX
6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezwv
9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8
LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq
XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOnM
LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw
EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k
OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o
kUd4
=F5Sx
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about reliability model result

2015-08-21 Thread dahan
Hi,
I have crosspost this issue here and in github,
but no response yet.

Any advice?

On Mon, Aug 10, 2015 at 10:21 AM, dahan  wrote:

>
> Hi all, I have tried the reliability model:
> https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> I run the tool with default configuration, and cannot understand the
> result.
>
> ```
> storage   durabilityPL(site)  PL(copies) PL(NRE)
>   PL(rep)loss/PiB
> ----  --  --  --
>  --  --
> Disk: Enterprise 99.119%   0.000e+00   0.721457%   0.159744%
> 0.000e+00   8.812e+12
> RADOS: 1 cp  99.279%   0.000e+00   0.721457%   0.000865%
> 0.000e+00   5.411e+12
> RADOS: 2 cp  7-nines   0.000e+00   0.49%   0.003442%
> 0.000e+00   9.704e+06
> RADOS: 3 cp 11-nines   0.000e+00   5.090e-11   3.541e-09
> 0.000e+00   6.655e+02
> ```
>
> ```
> storage   durabilityPL(site)  PL(copies) PL(NRE)
>   PL(rep)loss/PiB
> ----  --  --  --
>  --  --
> Site (1 PB)  99.900%   0.099950%   0.000e+00   0.000e+00
> 0.000e+00   9.995e+11
> RADOS: 1-site, 1-cp  99.179%   0.099950%   0.721457%   0.000865%
> 0.000e+00   1.010e+12
> RADOS: 1-site, 2-cp  99.900%   0.099950%   0.49%   0.003442%
> 0.000e+00   9.995e+11
> RADOS: 1-site, 3-cp  99.900%   0.099950%   5.090e-11   3.541e-09
> 0.000e+00   9.995e+11
>
> ```
>
> The two result tables have different trend. In the first table, durability
> value is 1 cp < 2 cp < 3 cp. However, the second table results in 1 cp < 2
> cp = 3 cp.
>
> The two tables have the same PL(site),  PL(copies) , PL(NRE), and PL(rep).
> The only difference is PL(site). PL(site) is constant, since number of site
> is constant. The trend should be the same.
>
> How to explain the result?
>
> Anything I missed out? Thanks
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] TRIM / DISCARD run at low priority by the OSDs?

2015-08-21 Thread Chad William Seys
Hi All,

Is it possible to give TRIM / DISCARD initiated by krbd low priority on the 
OSDs?

I know it is possible to run fstrim at Idle priority on the rbd mount point, 
e.g. ionice -c Idle fstrim -v $MOUNT .  

But this Idle priority (it appears) only is within the context of the node 
executing fstrim .  If the node executing fstrim is Idle then the OSDs are 
very busy and performance suffers.

Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at 
low priority also?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com