[ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Jan Schermer
Hi,
I have a full-ssd cluster on my hands, currently running Dumpling, with plans 
to upgrade soon, and Openstack with RBD on top of that. While I am overall 
quite happy with the performance (scales well accross clients), there is one 
area where it really fails bad - big database workloads.

Typically, what a well-behaved database does is commit to disk every 
transaction before confirming it, so on a “typical” cluster with a write 
latency of 5ms (with SSD journal) the maximum number of transactions per second 
for a single client is 200 (likely more like 100 depending on the filesystem). 
Now, that’s not _too_ bad when running hundreds of small databases, but it’s 
nowhere near the required performance to subsitute an existing SAN or even just 
a simple RAID array with writeback cache.

First hope was that enabling RBD cache will help - but it really doesn’t 
because all the flushes (O_DIRECT writes) end on the drives and not in the 
cache. Disabling barriers in the client helps, but that makes it not crash 
consistent (unless one uses ext4 with journal_checksum etc., I am going to test 
that soon).

Are there any plans to change this behaviour - i.e. make the cache a real 
writeback cache?

I know there are good reasons not to do this, and I commend the developers for 
designing the cache this way, but real world workloads demand shortcuts from 
time to time - for example MySQL with its InnoDB engine has an option to only 
commit to disk every Nth transaction - and this is exactly the kind of thing 
I’m looking for. Not having every confirmed transaction/write on the disk is 
not a huge problem, having a b0rked filesystem is, so this should be safe as 
long as I/O order is preserved. Sadly, my database is not an InnoDB where I can 
tune something, but an enterprise behemoth that traditionally runs on FC 
arrays, it has no parallelism (that I could find), and always uses O_DIRECT for 
txlog.

(For the record - while the array is able to swallow 30K IOps for a minute, 
once the cache is full it slows to ~3 IOps, while CEPH happily gives the same 
200 IOps forever, bottom line is you always need more disks or more cache, and 
your workload should always be able to run without the cache anyway  - even 
enterprise arrays fail, and write cache is not always available, contrary to 
popular belief).

Is there some option that we could use right now to turn on a true writeback 
caching? Losing a few transactions is fine as long as ordering is preserved.
I was thinking “cache=unsafe” but I have no idea whether I/O order is preserved 
with that.
I already mentioned turning off barriers, which could be safe in some setups 
but needs testing.
Upgrading from Dumpling will probably help with scaling, but will it help write 
latency? I would need to get from 5ms/write to <1ms/write.
I investigated guest-side caching (enhanceio/flashcache) but that fails really 
bad when the guest or host crashes - lots of corruption. EnhanceIO in 
particular looked very nice and claims to respect barriers… not in my 
experience, though.

It might seem that what I want is evil, and it really is if you’re running a 
banking database, but for most people this is exactly what is missing to make 
their workloads run without having some sort of 80s SAN system in their 
datacentre, I think everyone here would appreciate that :-)

Thanks

Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-25 Thread Yan, Zheng
the kernel client bug should be fixed by
https://github.com/ceph/ceph-client/commit/72f22efb658e6f9e126b2b0fcb065f66ffd02239
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replacing OSD disks with SSD journal - journal disk space use

2015-05-25 Thread Eneko Lacunza

Hi all,

We have a firefly ceph cluster (using Promxox VE, but I don't think this 
is revelant), and found a OSD disk was having quite a high amount of 
errors as reported by SMART, and also quite high wait time as reported 
by munin, so we decided to replace it.


What I have done is down/out the osd, then remove it (removing 
partitions). Replace the disk and create a new OSD, which was created 
with the same ID as the removed one (as I was hoping to not change CRUSH 
map).


So everything has worked as expected, except one minor non-issue:
- Original OSD journal was on a separate SSD disk, which had partitions 
#1 and #2 (journals of 2 OSDs).

- Original journal partition (#1) was removed
- A new partition has been created as #1, but has been assigned space 
after the last existing partition. So there is now hole of 5GB in the 
beginning of SSD disk. Promox is using ceph-disk prepare for this, I 
seen in the docs (http://ceph.com/docs/master/man/8/ceph-disk/) that 
ceph-disk prepare creates a new partition in the journal block device.


What I'm afraid is that given enough OSD replacements, Proxmox wouldn't 
find free space for new journals in that SSD disk? Although there would 
be plenty in the beginning?


Maybe the journal-partition creation can be improved so that it can 
detect free space also in the beginning and between existing partitions?


Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Nick Fisk
Hi Jan,

I share your frustrations with slow sync writes. I'm exporting RBD's via iSCSI 
to ESX, which seems to do most operations in 64k sync IO's. You can do a fio 
run and impress yourself with the numbers that you can get out of the cluster, 
but this doesn't translate into what you can achieve when using sync writes 
with a client.

I have too been experimenting with flashcache/enhanceio with the goal to use 
Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just testing 
with a single iSCSI server and see a massive improvement. I'm interested in the 
corruptions you have been experiencing on host crashes, are you implying that 
you think flashcache is buffering writes before submitting them to the SSD? 
When watching its behaviour using iostat it looks like it submits everything in 
4k IO's to the SSD which to me looks like it is not buffering.

I did raise a topic a few months back asking about the possibility of librbd 
supporting persistent caching to SSD's, which would allow write back caching 
regardless if the client requests a flush. Although there was some interest in 
the idea, I didn't get the feeling it would be at the top of anyone's 
priority's.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 25 May 2015 09:59
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Synchronous writes - tuning and some thoughts about
> them?
> 
> Hi,
> I have a full-ssd cluster on my hands, currently running Dumpling, with plans
> to upgrade soon, and Openstack with RBD on top of that. While I am overall
> quite happy with the performance (scales well accross clients), there is one
> area where it really fails bad - big database workloads.
> 
> Typically, what a well-behaved database does is commit to disk every
> transaction before confirming it, so on a “typical” cluster with a write 
> latency
> of 5ms (with SSD journal) the maximum number of transactions per second
> for a single client is 200 (likely more like 100 depending on the filesystem).
> Now, that’s not _too_ bad when running hundreds of small databases, but
> it’s nowhere near the required performance to subsitute an existing SAN or
> even just a simple RAID array with writeback cache.
> 
> First hope was that enabling RBD cache will help - but it really doesn’t
> because all the flushes (O_DIRECT writes) end on the drives and not in the
> cache. Disabling barriers in the client helps, but that makes it not crash
> consistent (unless one uses ext4 with journal_checksum etc., I am going to
> test that soon).
> 
> Are there any plans to change this behaviour - i.e. make the cache a real
> writeback cache?
> 
> I know there are good reasons not to do this, and I commend the developers
> for designing the cache this way, but real world workloads demand shortcuts
> from time to time - for example MySQL with its InnoDB engine has an option
> to only commit to disk every Nth transaction - and this is exactly the kind of
> thing I’m looking for. Not having every confirmed transaction/write on the
> disk is not a huge problem, having a b0rked filesystem is, so this should be
> safe as long as I/O order is preserved. Sadly, my database is not an InnoDB
> where I can tune something, but an enterprise behemoth that traditionally
> runs on FC arrays, it has no parallelism (that I could find), and always uses
> O_DIRECT for txlog.
> 
> (For the record - while the array is able to swallow 30K IOps for a minute,
> once the cache is full it slows to ~3 IOps, while CEPH happily gives the same
> 200 IOps forever, bottom line is you always need more disks or more cache,
> and your workload should always be able to run without the cache anyway  -
> even enterprise arrays fail, and write cache is not always available, contrary
> to popular belief).
> 
> Is there some option that we could use right now to turn on a true writeback
> caching? Losing a few transactions is fine as long as ordering is preserved.
> I was thinking “cache=unsafe” but I have no idea whether I/O order is
> preserved with that.
> I already mentioned turning off barriers, which could be safe in some setups
> but needs testing.
> Upgrading from Dumpling will probably help with scaling, but will it help 
> write
> latency? I would need to get from 5ms/write to <1ms/write.
> I investigated guest-side caching (enhanceio/flashcache) but that fails really
> bad when the guest or host crashes - lots of corruption. EnhanceIO in
> particular looked very nice and claims to respect barriers… not in my
> experience, though.
> 
> It might seem that what I want is evil, and it really is if you’re running a
> banking database, but for most people this is exactly what is missing to make
> their workloads run without having some sort of 80s SAN system in their
> datacentre, I think everyone here would appreciate that :-)
> 
> Thanks
> 
> Jan
> ___
> ceph-users maili

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-05-25 Thread Jan Schermer
Hi Nick,

flashcache doesn’t support barriers, so I haven’t even considered it. I used a 
few years ago to speed up some workloads out of curiosity and it worked well, 
but I can’t use it to cache this kind of workload.

EnhanceIO passed my initial testing (although the documentation is very sketchy 
and the project abandoned AFAIK), and is supposed to respect barriers/flushes. 
I was only interested in a “volatile cache” scenario - create a ramdisk in the 
guest (for example 1GB) and use it to cache the virtual block device (and of 
course flush and remove it before rebooting). All worked pretty well during my 
testing with fio & stuff until I ran the actual workload - in my case a DB2 9.7 
database. It took just minutes for the kernel to panic (I can share a 
screenshot if you’d like). So it was not a host failure but a guest failure and 
it managed to fail on two fronts - stability and crash consistency - at the 
same time. The filesystem was completely broken afterwards - while it could be 
mounted “cleanly” (journal appeared consistent), there was massive damage to 
the files. I expected the open files to be zeroed or missing or damaged, but it 
did veryrandom damage all over the place including binaries in /bin, manpage 
files and so on - things that nobody was even touching. Scary.

I don’t really understand your question about flashcache - do you run it in 
writeback mode? It’s been years since I used it so I won’t be much help here - 
I disregarded it as unsafe right away because of barriers and wouldn’t use it 
in production.

I don’t think a persistent cache is something to do right now, it would be 
overly complex to implement, it would limit migration, and it can be done on 
the guest side with (for example) bcache if really needed - you can always 
expose a local LVM volume to the guest and use it for caching (and that’s 
something I might end up doing) with mostly the same effect.
For most people (and that’s my educated guess) the only needed features are 
that it needs to be fast(-er) and it needs to come up again after a crash 
without recovering for backup - that’s something that could be just a slight 
modification to the existing RBD cache - just don’t flush it on every fsync() 
but maintain ordering - and it’s done? I imagine some ordering is there 
already, it must be flushed when the guest is migrated, and it’s 
production-grade and not just some hackish attempt. It just doesn’t really 
cache the stuff that matters most in my scenario…

I wonder if cache=unsafe does what I want, but it’s hard to test that 
assumption unless something catastrophic happens like it did with EIO…

Jan

> On 25 May 2015, at 19:58, Nick Fisk  wrote:
> 
> Hi Jan,
> 
> I share your frustrations with slow sync writes. I'm exporting RBD's via 
> iSCSI to ESX, which seems to do most operations in 64k sync IO's. You can do 
> a fio run and impress yourself with the numbers that you can get out of the 
> cluster, but this doesn't translate into what you can achieve when using sync 
> writes with a client.
> 
> I have too been experimenting with flashcache/enhanceio with the goal to use 
> Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just 
> testing with a single iSCSI server and see a massive improvement. I'm 
> interested in the corruptions you have been experiencing on host crashes, are 
> you implying that you think flashcache is buffering writes before submitting 
> them to the SSD? When watching its behaviour using iostat it looks like it 
> submits everything in 4k IO's to the SSD which to me looks like it is not 
> buffering.
> 
> I did raise a topic a few months back asking about the possibility of librbd 
> supporting persistent caching to SSD's, which would allow write back caching 
> regardless if the client requests a flush. Although there was some interest 
> in the idea, I didn't get the feeling it would be at the top of anyone's 
> priority's.
> 
> Nick
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Jan Schermer
>> Sent: 25 May 2015 09:59
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Synchronous writes - tuning and some thoughts about
>> them?
>> 
>> Hi,
>> I have a full-ssd cluster on my hands, currently running Dumpling, with plans
>> to upgrade soon, and Openstack with RBD on top of that. While I am overall
>> quite happy with the performance (scales well accross clients), there is one
>> area where it really fails bad - big database workloads.
>> 
>> Typically, what a well-behaved database does is commit to disk every
>> transaction before confirming it, so on a “typical” cluster with a write 
>> latency
>> of 5ms (with SSD journal) the maximum number of transactions per second
>> for a single client is 200 (likely more like 100 depending on the 
>> filesystem).
>> Now, that’s not _too_ bad when running hundreds of small databases, but
>> it’s nowhere near the required performance to subsitute an e

[ceph-users] ceph-users mailing list

2015-05-25 Thread heyun



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw load/performance/crashing

2015-05-25 Thread Daniel Hoffman
Hi All.

We are trying to cope with radosGW crashing every 5-15mins. This seems to
be getting worse and worse but we are unable to determine the cause,
nothing in the logs as it appears to be a radosgw hang.

The port is open, accepts a connect but there is no response to a HEAD/GET
etc etc.

We are unsure where to go from here.

We have HAProxy running on a dual 10G connected server. It is also doing
SSL offload for the gateways.

The gateways are civetweb. We run obj01/02 on physical hardware. We have
attempted to run 4 instances on the same machine, the machine can cope, but
the instances still crash too.

We are running 0.94-1337-gce175f3-1 which is
https://github.com/ceph/ceph/tree/wip-rgw-orphans/src/rgw

Attached is the data via the load balancer for the last week. As you can
see its close to 500-900MB/s at most times.

[client.radosgw.ceph-obj02]
  host = ceph-obj02
  keyring = /etc/ceph/keyring.radosgw.ceph-obj02
  rgw socket path = /tmp/radosgw.sock
  log file = /var/log/ceph/radosgw.log
  rgw data = /var/lib/ceph/radosgw/ceph-obj02
  rgw thread pool size = 1024
  rgw print continue = False
  rgw enable ops log = False
  log to stderr = False
  rgw enable usage log = False

Anyone have any thoughts? Is this just a pure capacity/performance issue
with civetweb and I need to run more threads/gateways?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw load/performance/crashing

2015-05-25 Thread Daniel Hoffman
Hi All.

We are trying to cope with radosGW crashing every 5-15mins. This seems to
be getting worse and worse but we are unable to determine the cause,
nothing in the logs as it appears to be a radosgw hang.

The port is open, accepts a connect but there is no response to a HEAD/GET
etc etc.

We are unsure where to go from here.

We have HAProxy running on a dual 10G connected server. It is also doing
SSL offload for the gateways.

The gateways are civetweb. We run obj01/02 on physical hardware. We have
attempted to run 4 instances on the same machine, the machine can cope, but
the instances still crash too.

We are running 0.94-1337-gce175f3-1 which is
https://github.com/ceph/ceph/tree/wip-rgw-orphans/src/rgw

Attached is the data via the load balancer for the last week. As you can
see its close to 500-900MB/s at most times.

[client.radosgw.ceph-obj02]
  host = ceph-obj02
  keyring = /etc/ceph/keyring.radosgw.ceph-obj02
  rgw socket path = /tmp/radosgw.sock
  log file = /var/log/ceph/radosgw.log
  rgw data = /var/lib/ceph/radosgw/ceph-obj02
  rgw thread pool size = 1024
  rgw print continue = False
  rgw enable ops log = False
  log to stderr = False
  rgw enable usage log = False

Anyone have any thoughts? Is this just a pure capacity/performance issue
with civetweb and I need to run more threads/gateways?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multi-Object delete and RadosGW

2015-05-25 Thread Daniel Hoffman
Has anyone come accross a problem with multi-object deletes.

We have a number of systems that send we think are sending big piles of
POST/XML and multi-object deletes.

Has anyone had any experience with this locking up civetweb or
apache/fast_cgi threads. Are there any tunable settings we could use ?

Thanks

Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance and CPU load on HP servers running ceph (DL380 G6, should apply to others too)

2015-05-25 Thread Tuomas Juntunen
Hi

 

I wanted to share my findings of running ceph on HP servers.

 

We had a lot of problems with CPU load, which was sometimes even 800. We
were trying to figure out why this happens even while not doing anything
special.

 

Our OSD nodes are running DL380 G6 with Dual Quad core cpu's and 32gb
memory.

 

The solution we found to work was to set the following settings in bios

 

HP Power Profile Mode: Maximum Performance

Power Regulator Mode: Static High Performance

IntelR Turbo Boost Technology: Disabled

 

With these settings our loads never go over 20 and there are no "hangs" in
writes or reads at any time.

 

If anyone else has any experiences with these settings, I would appreciate
to hear about your findings. The Turbo Boost, is, I would assume the biggest
thing here. When CPU frequency is adjusted, the CPU's "hang" for a while to
do the adjustment, and when the adjustment happens a lot, it creates this
high load.

 

Br,

Tuomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked requests/ops?

2015-05-25 Thread Xavier Serrano
Hello,

We have observed that our cluster is often moving back and forth
from HEALTH_OK to HEALTH_WARN states due to "blocked requests".
We have also observed "blocked ops". For instance:

# ceph status
cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
 health HEALTH_WARN
1 requests are blocked > 32 sec
 monmap e5: 5 mons at 
{ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
election epoch 44, quorum 0,1,2,3,4 
ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5
 osdmap e5091: 120 osds: 100 up, 100 in
  pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects
13164 GB used, 168 TB / 181 TB avail
2048 active+clean
  client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s

# ceph health detail
HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests
1 ops are blocked > 67108.9 sec
1 ops are blocked > 67108.9 sec on osd.71
1 osds have slow requests


My questions are:
(1) Is it normal to have "slow requests" in a cluster?
(2) Or is it a symptom that indicates that something is wrong?
(for example, a disk is about to fail)
(3) How can we fix the "slow requests"?
(4) What's the meaning of "blocked ops", and how can they be
blocked so long? (67000 seconds is more than 18 hours!)
(5) How can we fix the "blocked ops"?


Thank you very much for your help.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests/ops?

2015-05-25 Thread Christian Balzer

Hello,

Firstly, find my "Unexplainable slow request" thread in the ML archives
and read all of it.

On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote:

> Hello,
> 
> We have observed that our cluster is often moving back and forth
> from HEALTH_OK to HEALTH_WARN states due to "blocked requests".
> We have also observed "blocked ops". For instance:
> 
As always SW versions and a detailed HW description (down to the model of
HDDs used) will be helpful and educational.

> # ceph status
> cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
>  health HEALTH_WARN
> 1 requests are blocked > 32 sec
>  monmap e5: 5 mons at
> {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
> election epoch 44, quorum 0,1,2,3,4
> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap
> e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373
> GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048
> active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s
> 
> # ceph health detail
> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests
> 1 ops are blocked > 67108.9 sec
> 1 ops are blocked > 67108.9 sec on osd.71
> 1 osds have slow requests
> 
You will want to have a very close look at osd.71 (logs, internal
counters, cranking up debugging), but might find it just as mysterious as
my case in the thread mentioned above.

> 
> My questions are:
> (1) Is it normal to have "slow requests" in a cluster?
Not really, though the Ceph developers clearly think those just merits a
WARNING level, whereas I would consider those a clear sign of brokenness,
as VMs or other clients with those requests pending are likely to be
unusable at that point.

> (2) Or is it a symptom that indicates that something is wrong?
> (for example, a disk is about to fail)
That. Of course your cluster could be just at the edge of its performance
and nothing but improving that (most likely by adding more nodes/OSDs)
would fix that.

> (3) How can we fix the "slow requests"?
Depends on cause of course.
AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

> (4) What's the meaning of "blocked ops", and how can they be
> blocked so long? (67000 seconds is more than 18 hours!)
Precisely, this shouldn't happen.

> (5) How can we fix the "blocked ops"?
> 
AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com