[ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM

2014-06-12 Thread Ke-fei Lin
Hi list, I deployed a Windows 7 VM with qemu-rbd disk, and got an unexpected booting phase performance. I discovered that when booting the Windows VM up, there are consecutive ~2 minutes that `ceph -w` gives me an interesting log like: "... 567 KB/s rd, 567 op/s", "... 789 KB/s rd, 789 op/s" and

Re: [ceph-users] Fixing inconsistent placement groups

2014-06-12 Thread Gregory Farnum
The OSD should have logged the identities of the inconsistent objects to the central log on the monitors, as well as to its own local log file. You'll need to identify for yourself which version is correct, which will probably involve going and looking at them inside each OSD's data store. If the p

[ceph-users] Fixing inconsistent placement groups

2014-06-12 Thread Aaron Ten Clay
I'm having trouble finding a concise set of steps to repair inconsistent placement groups. I know from other threads that issuing a 'ceph pg repair ...' command could cause loss of data integrity if the primary OSD happens to have the bad copy of the placement group. I know how to find which PG's a

Re: [ceph-users] What exactly is the kernel rbd on osd issue?

2014-06-12 Thread David Zafman
This was commented on recently on ceph-users, but I’ll explain the scenario. If the single kernel needs to flush rbd blocks to reclaim memory and the OSD process needs memory to handle the flushes, you end up deadlocked. If you run the rbd client in a VM with dedicated memory allocation from th

[ceph-users] What exactly is the kernel rbd on osd issue?

2014-06-12 Thread lists+ceph
I remember reading somewhere that the kernel ceph clients (rbd/fs) could not run on the same host as the OSD. I tried finding where I saw that, and could only come up with some irc chat logs. The issue stated there is that there can be some kind of deadlock. Is this true, and if so, would you ha

Re: [ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Sage Weil
On Fri, 13 Jun 2014, Charles 'Boyo wrote: > Aha! Thanks Sage. > > I completely get it now. So I can use a ramdisk provided it is always > flushed to disk during shutdowns and I never have unplanned outages > right? i.e., never! :) > Does this hard OSD consistency also explain why eventual con

Re: [ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Charles 'Boyo
Aha! Thanks Sage. I completely get it now. So I can use a ramdisk provided it is always flushed to disk during shutdowns and I never have unplanned outages right? Does this hard OSD consistency also explain why eventual consistency at the RADOS level was "designed" out? Having all OSDs in a rep

Re: [ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Sage Weil
On Fri, 13 Jun 2014, Charles 'Boyo wrote: > Hello Sage. > > I'm running xfs and crashes are rarely enough. When they do happen, I > would rather just rebuild the entire cluster than bother with fsck > anyway. I mean any unclean/abrupt shutdown of ceph-osd, not an XFS error. Like a power failu

Re: [ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Charles 'Boyo
Hello Sage. I'm running xfs and crashes are rarely enough. When they do happen, I would rather just rebuild the entire cluster than bother with fsck anyway. So can you show me how to turn off journalling using the xfs FileStore backend? :) Charles --Original Message-- From: Sage Weil T

Re: [ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Sage Weil
On Thu, 12 Jun 2014, Charles 'Boyo wrote: > Hello list. > > Is it possible, or will it ever be possible to disable the OSD's > journalling activity? > > I understand it is risky and has the potential for data loss but in my > use case, the data is easily re-built from scratch and I'm really >

[ceph-users] Disabling OSD journals, parallel reads and eventual consistency for RBD

2014-06-12 Thread Charles 'Boyo
Hello list. Is it possible, or will it ever be possible to disable the OSD's journalling activity? I understand it is risky and has the potential for data loss but in my use case, the data is easily re-built from scratch and I'm really bothered about the reduced throughput "wasted" on journall

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Xu (Simon) Chen
We actually disabled swap all together on these machines... On Thu, Jun 12, 2014 at 5:06 PM, Gregory Farnum wrote: > To be clear, that's the solution to one of the causes of this issue. > The log message is very general, and just means that a disk access > thread has been gone for a long time (

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Gregory Farnum
To be clear, that's the solution to one of the causes of this issue. The log message is very general, and just means that a disk access thread has been gone for a long time (15 seconds, in this case) without checking in (so usually, it's been inside of a read/write syscall for >=15 seconds). Other

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Mark Nelson
Can you check and see if swap is being used on your OSD servers when this happens, and even better, use something like collectl or another tool to look for major page faults? If you see anything like this, you may want to tweak swappiness to be lower (say 10). Mark On 06/12/2014 03:17 PM, X

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Xu (Simon) Chen
I've done some more tracing. It looks like the high IO wait in VMs are somewhat correlated when some OSDs have high inflight ops (ceph admin socket, dump_ops_in_flight). When in_flight_ops is high, I see something like this in the OSD log: 2014-06-12 19:57:24.572338 7f4db6bdf700 1 heartbeat_map r

Re: [ceph-users] HEALTH_WARN pool has too few pgs

2014-06-12 Thread Eric Eastman
Hi JC, The cluster already has 1024 PGs on only 15 OSD, which is above the formula of (100 x #OSDs)/size. How large should I make it? # ceph osd dump | grep Ray pool 17 'Ray' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 7785 owner 0 f

Re: [ceph-users] bootstrap-mds, bootstrap-osd and admin keyring not found

2014-06-12 Thread Shayan Saeed
Hi, I am following the standard deployment guide for ceph firefly. When I try to do the step 5 for collecting the key, it gives me warnings saying that keyrings not found for bootstrap-mds, bootstrap-osd and admin due to which the next step for deploying osds fail. Other people on this forum have

[ceph-users] bootstrap-mds, bootstrap-osd and admin keyring not found

2014-06-12 Thread Shayan Saeed
Hi, I am following the standard deployment guide for ceph firefly. When I try to do the step 5 for collecting the key, it gives me warnings saying that keyrings not found for bootstrap-mds, bootstrap-osd and admin due to which the next step for deploying osds fail. Other people on this forum have

Re: [ceph-users] [ceph] OSD priority / client localization

2014-06-12 Thread Gregory Farnum
You can set up pools which have all their primaries in one data center, and point the clients at those pools. But writes will still have to traverse the network link because Ceph does synchronous replication for strong consistency. If you want them to both write to the same pool, but use local OSD

Re: [ceph-users] error (24) Too many open files

2014-06-12 Thread Eliezer Croitoru
ulimit -Sa ulimit -Ha Which will show you your limits. If you are hitting this limit and it's 16k I would say that the server is not tuned for your needs and upper it. If it's more then that but not reaching 1Million or any other very high number I would say use "lsof -n|wc -l" to get some sta

Re: [ceph-users] Moving Ceph cluster to different network segment

2014-06-12 Thread John Wilkins
Fred, I'm not sure it will completely answer your question, but I would definitely have a look at: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address There are some important steps in there for monitors. On Wed, Jun 11, 2014 at 12:08 PM, Fred Yang wrot

[ceph-users] error (24) Too many open files

2014-06-12 Thread Gregory Farnum
You probably just want to increase the ulimit settings. You can change the OSD setting, but that only covers file descriptors against the backing store, not sockets for network communication -- the latter is more often the one that runs out. -Greg On Thursday, June 12, 2014, Christian Kauhaus > wr

Re: [ceph-users] Fail to Block Devices and OpenStack

2014-06-12 Thread 山下 良民
HI, Thanks for your information! I will check it soon, and will post results later, Thanks a lot and best regards, Yamashita = OSS Laboratories Inc. Yoshitami Yamashita Mail:yamash...@ossl.co.jp - 元のメッセージ - 差出人: "Karan Singh" 宛先: "山下 良民" Cc: ceph-users@list

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Mark Nelson
On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote: 1) I did check iostat on all OSDs, and iowait seems normal. 2) ceph -w shows no correlation between high io wait and high iops. Sometimes the reverse is true: when io wait is high (since it's a cluster wide thing), the overall ceph iops drops too.

Re: [ceph-users] Can we map OSDs from different hosts (servers) to a Pool in Ceph

2014-06-12 Thread Gregory Farnum
On Thu, Jun 12, 2014 at 2:21 AM, VELARTIS Philipp Dürhammer wrote: > Hi, > > Will ceph support mixing different disk pools (example spinners and ssds) in > the future a little bit better (more safe)? There are no immediate plans to do so, but this is an extension to the CRUSH language that we're

[ceph-users] error (24) Too many open files

2014-06-12 Thread Christian Kauhaus
Hi, we have a Ceph cluster with 32 OSDs running on 4 servers (8 OSDs per server, one for each disk). From time to time, I see Ceph servers running out of file descriptors. It logs lines like: > 2014-06-08 22:15:35.154759 7f850ac25700 0 filestore(/srv/ceph/osd/ceph-20) write couldn't open 86.37_

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Xu (Simon) Chen
1) I did check iostat on all OSDs, and iowait seems normal. 2) ceph -w shows no correlation between high io wait and high iops. Sometimes the reverse is true: when io wait is high (since it's a cluster wide thing), the overall ceph iops drops too. 3) We have collectd running in VMs, and that's how

Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread David
Hi Simon, Did you check iostat on the OSDs to check their utilization? What does your ceph -w say - pehaps you’re maxing your cluster’s IOPS? Also, are you running any monitoring of your VMs iostats? We’ve often found some culprits overusing IOs.. Kind Regards, David Majchrzak 12 jun 2014 kl.

[ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Xu (Simon) Chen
Hi folks, We have two similar ceph deployments, but one of them is having trouble: VMs running with ceph-provided block devices are seeing frequent high io wait, every a few minutes, usually 15-20%, but as high as 60-70%. This is cluster-wide and not correlated with VM's IO load. We turned on rbd

Re: [ceph-users] Backfilling, latency and priority

2014-06-12 Thread Andrey Korolyov
On Thu, Jun 12, 2014 at 5:02 PM, David wrote: > Thanks Mark! > > Well, our workload has more IOs and quite low throughput, perhaps 10MB/s -> > 100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / > sql). > During the recovery we had ranged between 600-1000MB/s throughput.

Re: [ceph-users] Backfilling, latency and priority

2014-06-12 Thread David
Thanks Mark! Well, our workload has more IOs and quite low throughput, perhaps 10MB/s -> 100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / sql). During the recovery we had ranged between 600-1000MB/s throughput. So the only way to currently ”fix” this is to have enough

Re: [ceph-users] OSDs

2014-06-12 Thread Mark Nelson
On 06/12/2014 07:27 AM, Christian Kauhaus wrote: Am 12.06.2014 14:09, schrieb Loic Dachary: With the replication factor set to three (which is the default), it can tolerate that two OSD fail at the same time. I've noticed that a replication factor of 3 is the new default in firefly. What rati

Re: [ceph-users] Backfilling, latency and priority

2014-06-12 Thread Mark Nelson
On 06/12/2014 03:44 AM, David wrote: Hi, We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs). We lost an OSD and the cluster started to backfill the data to the rest of the OSDs - during which the latency skyrocketed on some OSDs and connected clients experienced massive IO

Re: [ceph-users] OSDs

2014-06-12 Thread Christian Kauhaus
Am 12.06.2014 14:09, schrieb Loic Dachary: > With the replication factor set to three (which is the default), it can > tolerate that two OSD fail at the same time. I've noticed that a replication factor of 3 is the new default in firefly. What rationale led to changing the default? It used to be

Re: [ceph-users] OSDs

2014-06-12 Thread Loic Dachary
Hi, With the replication factor set to three (which is the default), it can tolerate that two OSD fail at the same time. Cheers On 12/06/2014 13:43, yalla.gnan.ku...@accenture.com wrote: > Hi All, > > > > Upto how many OSD failures can ceph tolerate ? > > > > > > Thanks > > Kumar >

[ceph-users] OSDs

2014-06-12 Thread yalla.gnan.kumar
Hi All, Upto how many OSD failures can ceph tolerate ? Thanks Kumar This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender im

[ceph-users] Add fourth monitor problem

2014-06-12 Thread Cao, Buddy
Hi, I added the fourth monitor in the cluster, but it's always down even after I restart the mon service. I attached the logs below, could you help? [root@ceph1]# ceph health detail HEALTH_WARN 1 mons down, quorum 0,1,2 1,3,0; clock skew detected on mon.3, mon.0 mon.8 (rank 3) addr 192.168.1.4:6

Re: [ceph-users] Can we map OSDs from different hosts (servers) to a Pool in Ceph

2014-06-12 Thread VELARTIS Philipp Dürhammer
Hi, Will ceph support mixing different disk pools (example spinners and ssds) in the future a little bit better (more safe)? Thank you philipp On Wed, Jun 11, 2014 at 5:18 AM, Davide Fanciola wrote: > Hi, > > we have a similar setup where we have SSD and HDD in the same hosts. > Our very basic

[ceph-users] Backfilling, latency and priority

2014-06-12 Thread David
Hi, We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs). We lost an OSD and the cluster started to backfill the data to the rest of the OSDs - during which the latency skyrocketed on some OSDs and connected clients experienced massive IO wait. I’m trying to rectify the situa

Re: [ceph-users] Striping

2014-06-12 Thread David
Hi, Depends what you mean with a ”user”. You can set up pools with different replication / erasure coding etc: http://ceph.com/docs/master/rados/operations/pools/ Kind Regards, David Majchrzak 12 jun 2014 kl. 10:22 skrev : > Hi All, > > > I have a ceph cluster. If a user wants just st

[ceph-users] Striping

2014-06-12 Thread yalla.gnan.kumar
Hi All, I have a ceph cluster. If a user wants just striping or distributed or replicated storages , can we provide these types of storages exclusively ? Thanks Kumar This message is for the designated recipient only and may contain privileged, proprietar

[ceph-users] [ceph] OSD priority / client localization

2014-06-12 Thread NEVEU Stephane
Hi all, One short question quite useful for me : Is there a way to set up a highest osd/host priority for some clients in a datacenter and do the opposite in another datacenter ? I mean, my network links between those datacenters will be used in case of failover for clients accessing data on ce

Re: [ceph-users] ceph-deploy - problem creating an osd

2014-06-12 Thread Markus Goldberg
Am 11.06.2014 16:47, schrieb Alfredo Deza: On Wed, Jun 11, 2014 at 9:29 AM, Markus Goldberg wrote: Hi, ceph-deploy-1.5.3 can make trouble, if a reboot is done between preparation and aktivation of an osd: The osd-disk was /dev/sdb at this time, osd itself should go to sdb1, formatted to cleare