Re: [ceph-users] capacity planning - iops

2016-09-19 Thread Jan Schermer
Are you talking about global IOPS or per-VM/per-RBD device? And at what queue depth? It all comes down to latency. Not sure what the numbers can be on recent versions of Ceph and on modern OSes but I doubt it will be <1ms for the OSD daemon alone. That gives you 1000 real synchronous IOPS. With h

Re: [ceph-users] BlueStore write amplification

2016-08-22 Thread Jan Schermer
Is that 400MB on all nodes or on each node? If it's on all nodes then 10:1 is not that surprising. What what the block size in your fio benchmark? We had much higher amplification on our cluster with snapshots and stuff... Jan > On 23 Aug 2016, at 08:38, Zhiyuan Wang wrote: > > Hi > I have te

Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Jan Schermer
I had to make a cronjob to trigger compact on the MONs as well. Ancient version, though. Jan > On 11 Aug 2016, at 10:09, Wido den Hollander wrote: > > >> Op 11 augustus 2016 om 9:56 schreef Eugen Block : >> >> >> Hi list, >> >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
000-92945 >> 243 NAND_Writes_32MiB -O--CK 100 100 000-95289 >> >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space >> seems to be holding steady. >> >> AFAIK, we've only had one other S3610 fail, and it seeme

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-) 600GB/month is not much for S36xx at all, must be some sort of defect then... Jan > On 03 Aug 2016, at 12:15, Jan Schermer wrote: > > Make sure you are reading the right attribute and interpreting it r

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Make sure you are reading the right attribute and interpreting it right. update-smart-drivedb sometimes makes wonders :) I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month?? Jan > On 03 Aug 2016, at 12:06, Maxime Guyot wro

Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jan Schermer
> On 20 Jul 2016, at 18:38, Mike Christie wrote: > > On 07/20/2016 03:50 AM, Frédéric Nass wrote: >> >> Hi Mike, >> >> Thanks for the update on the RHCS iSCSI target. >> >> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is >> it too early to say / announce). > > No HA

Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
my maintainance >> * start the physical node >> * reseat and activate the OSD drives one by one >> * unset the noout flag >> > > That should do it indeed. Take your time between the OSDs and that should > limit the 'downtime' for clients.

Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
If you stop the OSDs cleanly then that should cause no disruption to clients. Starting the OSD back up is another story, expect slow request for a while there and unless you have lots of very fast CPUs on the OSD node, start them one-by-one and not all at once. Jan > On 13 Jul 2016, at 14:37,

Re: [ceph-users] ceph + vmware

2016-07-08 Thread Jan Schermer
There is no Ceph plugin for VMware (and I think you need at least an Enterprise license for storage plugins, much $$$). The "VMware" way to do this without the plugin would be to have a VM running on every host serving RBD devices over iSCSI to the other VMs (the way their storage applicances wo

Re: [ceph-users] Disk failures

2016-06-14 Thread Jan Schermer
Hi, bit rot is not "bit rot" per se - nothing is rotting on the drive platter. It occurs during reads (mostly, anyway), and it's random. You can happily read a block and get the correct data, then read it again and get garbage, then get correct data again. This could be caused by a worn out cell

Re: [ceph-users] Ubuntu Trusty: kernel 3.13 vs kernel 4.2

2016-06-14 Thread Jan Schermer
One storage setups has exhibited extremely poor performance in my lab on 4.2 kernel (mdraid1+lvm+nfs), others run fine. No problems with xenial so far. If I had to choose a LTS kernel for trusty I'd choose the xenial one. (Btw I think newest trusty point release already has the 4.2 HWE stack by

Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread Jan Schermer
I'd be worried about it getting "fast" all of sudden. Test crash consistency. If you test something like file creation you should be able to estimate if it should be that fast. (So it should be some fraction of theoretical IOPS on the drives/backing rbd device...) If it's too fast then maybe the

Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread Jan Schermer
It should be noted that using "async" with NFS _will_ corrupt your data if anything happens. It's ok-ish for something like an image library, but it's most certainly not OK for VM drives, databases, or if you write any kind of binary blobs that you can't recreate. If ceph-fuse is fast (you are

Re: [ceph-users] "clock skew detected" wont go away.

2016-05-18 Thread Jan Schermer
Have you tried restarting the mons? Did you change timezone or ran hwclock or something like that during their lifetime? And if you're running them in containers, are you providing them with /etc/adjtime and such? Jan > On 19 May 2016, at 07:29, Stefan Eriksson wrote: > > I'm using hammer and

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-13 Thread Jan Schermer
>> My systemd-foo isn't that good either, so I don't know what is happening >> here. >> >> Wido >> >>> Op 12 mei 2016 om 15:31 schreef Jan Schermer : >>> >>> >>> Btw try replacing >>> >>> WantedBy=ceph-mon.t

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
ote: > > > To also answer Sage's question: No, this is a fresh Jewel install in a few > test VMs. This system was not upgraded. > > It was installed 2 hours ago. > >> Op 12 mei 2016 om 14:51 schreef Jan Schermer : >> >> >> Can you post the c

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
So systemctl is-enabled ceph-mon.target says "enabled" as well? I think it should start then, or at least try Jan > On 12 May 2016, at 15:14, Wido den Hollander wrote: > > >> Op 12 mei 2016 om 15:12 schreef Jan Schermer : >> >> >> What

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
o. > >> Op 12 mei 2016 om 14:51 schreef Jan Schermer : >> >> >> Can you post the contents of ceph-mon@.service file? >> > > Yes, here you go: > > root@charlie:~# cat /lib/systemd/system/ceph-mon@.service > [Unit] > Description=Ceph clus

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
Can you post the contents of ceph-mon@.service file? what does systemctl is-enabled ceph-mon@charlie say? However, this looks like it was just started at a bad moment and died - nothing in logs? Jan > On 12 May 2016, at 14:44, Sage Weil wrote: > > On Thu, 12 May 2016, Wido den Hollander wro

Re: [ceph-users] Ubuntu or CentOS for my first lab. Please recommend. Thanks

2016-05-05 Thread Jan Schermer
This a always a topic that starts a flamewar my POV: Ubuntu + generally newer versions of software, packages are closer to vanilla versions + more community packages + several versions of HWE (kernels) to choose from over lifetime of the distro -

Re: [ceph-users] On-going Bluestore Performance Testing Results

2016-04-22 Thread Jan Schermer
Having correlated graphs of CPU and block device usage would be helpful. To my cynical eye this looks like a clear regression in CPU usage, which was always bottlenecking pure-SSD OSDs, and now got worse. The gains are from doing less IO on IO-saturated HDDs. Regression of 70% in 16-32K random w

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
; cluster that we are deploying has several hardware choices which go a long > way to improve this performance as well. Coupled with the coming Bluestore, > the future looks bright. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
7;t happen (if it did the contents would be the same). You say "Ceph needs", but I say "the guest VM needs" - there's the problem. > On 12 Apr 2016, at 21:58, Sage Weil wrote: > > Okay, I'll bite. > > On Tue, 12 Apr 2016, Jan Schermer wrote: >

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
> On 12 Apr 2016, at 20:00, Sage Weil wrote: > > On Tue, 12 Apr 2016, Jan Schermer wrote: >> I'd like to raise these points, then >> >> 1) some people (like me) will never ever use XFS if they have a choice >> given no choice, we will not use something t

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
I'd like to raise these points, then 1) some people (like me) will never ever use XFS if they have a choice given no choice, we will not use something that depends on XFS 2) choice is always good 3) doesn't majority of Ceph users only care about RBD? (Angry rant coming) Even our last performanc

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Jan Schermer
RIP Ceph. > On 11 Apr 2016, at 23:42, Allen Samuels wrote: > > RIP ext4. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samu...@sandisk.com > > >> -Original

Re: [ceph-users] Ceph-fuse huge performance gap between different block sizes

2016-03-25 Thread Jan Schermer
FYI when I performed testing on our cluster I saw the same thing. fio randwrite 4k test over a large volume was a lot faster with larger RBD object size (8mb was marginally better than the default 4mb). It makes no sense to me unless there is a huge overhead with increasing number of objects. Or

Re: [ceph-users] xfs: v4 or v5?

2016-03-25 Thread Jan Schermer
V5 is supposedly stable, but that only means it will be just as bad as any other XFS. I recommend avoiding XFS whenever possible. Ext4 works perfectly and I never lost any data with it, even when it got corrupted, while XFS still likes to eat the data when something goes wrong (and it will, lik

Re: [ceph-users] DONTNEED fadvise flag

2016-03-23 Thread Jan Schermer
So the OSDs pass this through to the filestore so it doesn't pollute the cache? That would be... surprising. Jan > On 23 Mar 2016, at 18:28, Gregory Farnum wrote: > > On Mon, Mar 21, 2016 at 6:02 AM, Yan, Zheng wrote: >> >>> On Mar 21, 2016, at 18:17, Kenneth Waegeman >>> wrote: >>> >>> T

Re: [ceph-users] Performance with encrypted OSDs

2016-03-20 Thread Jan Schermer
Compared to ceph-osd overhead, the dm-crypt overhead will be completely negligible for most scenarios. One exception could be sequential reads with slow CPU (not supporting AES), but I don't expect more than a few percent difference even then. Btw nicer solution is to use SED drives. Spindles ar

Re: [ceph-users] ssd only storage and ceph

2016-03-19 Thread Jan Schermer
> On 17 Mar 2016, at 17:28, Erik Schwalbe wrote: > > Hi, > > at the moment I do some tests with SSD's and ceph. > My Question is, how to mount an SSD OSD? With or without discard option? I recommend running without discard but running "fstrim" command every now and then (depends on how fast

Re: [ceph-users] xfs corruption

2016-03-07 Thread Jan Schermer
This functionality is common on RAID controllers in combination with HCL-certified drives. This usually means that you can't rely on it working unless you stick to the exact combination that's certified, which is impossible in practice. For example LSI controllers do this if you get the right SS

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Jan Schermer
I think the latency comes from journal flushing Try tuning filestore min sync interval = .1 filestore max sync interval = 5 and also /proc/sys/vm/dirty_bytes (I suggest 512MB) /proc/sys/vm/dirty_background_bytes (I suggest 256MB) See if that helps It would be useful to see the job you are runn

Re: [ceph-users] blocked i/o on rbd device

2016-03-02 Thread Jan Schermer
Are you exporting (or mounting) the NFS as async or sync? How much memory does the server have? Jan > On 02 Mar 2016, at 12:54, Shinobu Kinjo wrote: > > Ilya, > >> We've recently fixed two major long-standing bugs in this area. > > If you could elaborate more, it would be reasonable for the

Re: [ceph-users] Old CEPH (0.87) cluster degradation - putting OSDs down one by one

2016-02-27 Thread Jan Schermer
Anythink in dmesg/kern.log at the time this happened? 0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) ** I think your filesystem was somehow corrupted. An regarding this: 2. Physical HDD replaced and NOT added to CEPH - here we had strange kernel crash just afte

Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Jan Schermer
RBD backend might be even worse, depending on how large dataset you try. One 4KB block can end up creating a 4MB object, and depending on how well hole-punching and fallocate works on your system you could in theory end up with a >1000 amplification if you always hit a different 4MB chunk (but t

Re: [ceph-users] Guest sync write iops so poor.

2016-02-26 Thread Jan Schermer
Also take a look at Galera cluster. You can relax flushing to disk as long as all your nodes don't go down at the same time. (And when a node goes back up after a crash you should trash it before it rejoins the cluster) Jan > On 26 Feb 2016, at 11:01, Nick Fisk wrote: > > I guess my question

Re: [ceph-users] Guest sync write iops so poor.

2016-02-26 Thread Jan Schermer
O_DIRECT is _not_ a flag for synchronous blocking IO. O_DIRECT only hints the kernel that it needs not cache/buffer the data. The kernel is actually free to buffer and cache it and it does buffer it. It also does _not_ flush O_DIRECT writes to disk but it makes best effort to send it to the drives

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer
LSI sucks, so who cares :-) * Sorry if I mixed some layers, maybe it's not filesystem that calls discard but another layer in kernel, also not sure how exactly discard_* values are detected and when etc., but in essence it works like that. Jan > Rgds, > Shinobu > > -

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer
We are very happy with S3610s in our cluster. We had to flash a new firmware because of latency spikes (NCQ-related), but had zero problems after that... Just beware of HBA compatibility, even in passthrough mode some crappy firmwares can try and be smart about what you can do (LSI-Avago, I'm loo

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Jan Schermer
> On 25 Feb 2016, at 14:39, Nick Fisk wrote: > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Huan Zhang >> Sent: 25 February 2016 11:11 >> To: josh.dur...@inktank.com >> Cc: ceph-users >> Subject: [ceph-users] Guest sync write

Re: [ceph-users] cephfs mmap performance?

2016-02-19 Thread Jan Schermer
I don't think there's any point in MMAP-ing a virtual file. And I'd be surprised if there weren't any bugs or performance issues... Jan > On 19 Feb 2016, at 14:38, Dzianis Kahanovich wrote: > > I have content for apache 2.4 in cephfs, trying to be scalable, "EnableMMAP > On". > Some environmen

Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Jan Schermer
It would be helpful to see your crush map (there are some tunables that help with this issue as well available if you're not running ancient versions). However, distribution uniformity isn't that great really. It helps to increase the number of PGs, but beware that there's no turning back. Other

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
d > perhaps that's what is still filling the drive. With this info - do you think > I'm still safe to follow the steps suggested in previous post? > > Thanks! > > Lukas > > On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer <mailto:j...@schermer.cz>> wr

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
will the cluster attempt to move some pgs > out of these drives to different local OSDs? I'm asking because when I've > attempted to delete pg dirs and restart OSD for the first time, the OSD get > full again very fast. > > Thank you. > > Lukas > > > >

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Ahoj ;-) You can reweight them temporarily, that shifts the data from the full drives. ceph osd reweight osd.XX YY (XX = the number of full OSD, YY is "weight" which default to 1) This is different from "crush reweight" which defaults to drive size in TB. Beware that reweighting will (afaik) on

Re: [ceph-users] Dell Ceph Hardware recommendations

2016-02-10 Thread Jan Schermer
Dell finally sells a controller with true JBOD mode? The last I checked they only had "JBOD-via-RAID0" as a recommended solution (doh, numerous problems) and true JBOD was only offered for special use cases like hadoop storage. One can obviously reflash the controller to another mode, but that's

Re: [ceph-users] SSD Journal

2016-02-01 Thread Jan Schermer
ournal is "never" read from, only when the OSD process crashes. > > I'm happy to be corrected if I've misstated anything. > > Robert LeBlanc > > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >

Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer
> On 29 Jan 2016, at 16:00, Lionel Bouton > wrote: > > Le 29/01/2016 01:12, Jan Schermer a écrit : >> [...] >>> Second I'm not familiar with Ceph internals but OSDs must make sure that >>> their PGs are synced so I was under the impression that the OS

Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer
>> inline > On 29 Jan 2016, at 05:03, Somnath Roy wrote: > > < > From: Jan Schermer [mailto:j...@schermer.cz <mailto:j...@schermer.cz>] > Sent: Thursday, January 28, 2016 3:51 PM > To: Somnath Roy > Cc: Tyler Bishop; ceph-users@lists.ceph.com <mailto:c

Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
> On 28 Jan 2016, at 23:19, Lionel Bouton > wrote: > > Le 28/01/2016 22:32, Jan Schermer a écrit : >> P.S. I feel very strongly that this whole concept is broken fundamentaly. We >> already have a journal for the filesystem which is time proven, well behaved >&g

Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
ond of the multi-ms commiting limbo while data falls down throught those dream layers :P I really don't know how to explain that more. I bet if you ask on LKML, someone like Theodore Ts'o would say "you're doing completely superfluous work" in more technical terms. Jan

Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
You can't run Ceph OSD without a journal. The journal is always there. If you don't have a journal partition then there's a "journal" file on the OSD filesystem that does the same thing. If it's a partition then this file turns into a symlink. You will always be better off with a journal on a se

Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
r best solution now. And we > want to reduce the used tools count so we can solve this with Ceph it's good. > > Thanks for your help and I will check these. > > 2016-01-28 13:13 GMT+01:00 Jan Schermer <mailto:j...@schermer.cz>>: > Yum repo doesn't r

Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
CephFs, it will be the best solution for us, but it > is in beta and beta products cant allowed for us. This is why I tryto find > another, Ceph based solution. > > > > 2016-01-28 12:46 GMT+01:00 Jan Schermer <mailto:j...@schermer.cz>>: > This is somewhat

Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
This is somewhat confusing. CephFS is a shared filesystem - you mount that on N hosts and they can access the data simultaneously. RBD is a block device, this block device can be accesses from more than 1 host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS). Both CephFS and

Re: [ceph-users] OSD behavior, in case of its journal disk (either HDD or SSD) failure

2016-01-25 Thread Jan Schermer
OSD stops. And you pretty much lose all data on the OSD if you lose the journal. Jan > On 25 Jan 2016, at 14:04, M Ranga Swami Reddy wrote: > > Hello, > > If a journal disk fails (with crash or power failure, etc), what > happens on OSD operations? > > PS: Assume that journal and OSD is on a

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
3622 1 > UST ID: DE274086107 > > > Am 20.01.2016 um 14:14 schrieb Wade Holler: >> Great commentary. >> >> While it is fundamentally true that higher clock speed equals lower >> latency, I'm my practical experience we are more often interested in >>

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
he way not only depend > on the frequency ) is sufficient that its not breaking the HDD data output. > > > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > I

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
This is very true, but do you actually exclusively pin the cores to the OSD daemons so they don't interfere? I don't think may people do that, it wouldn't work with more than a handful of OSDs. The OSD might typicaly only need <100% of one core, but during startup or some reshuffling it's benefi

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
The OSD is able to use more than one core to do the work, so increasing the number of cores will increase throughput. However, if you care about latency then that is always tied to speed=frequency. If the question was "should I get 40GHz in 8 cores or in 16 cores" then the answer will always be

Re: [ceph-users] bad sectors on rbd device?

2016-01-06 Thread Jan Schermer
I think you are running out of memory(?), or at least of the memory for the type of allocation krbd tries to use. I'm not going to decode all the logs but you can try increasing min_free_kbytes as the first step. I assume this is amd64 when there's no HIGHMEM trouble (I don't remember how to sol

Re: [ceph-users] Random Write Fio Test Delay

2015-12-31 Thread Jan Schermer
Is it only on the first run or on every run? Fio first creates the file and that can take a while depeding on how fallocate() works on your system. In other words you are probably waiting for a 1G file to be written before the test actually starts. Jan > On 31 Dec 2015, at 04:49, Sam Huracan

Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Jan Schermer
with ceph-deploy it >> should be added in the conf file automatically..How are you creating your >> cluster ? >> Did you change conf file after installing ? >> >> From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz >> <mailto:samek...@fel.cvut.cz>]

Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Jan Schermer
Has the cluster ever worked? Are you sure that "mon initial members = 0" is correct? How do the OSDs know where to look for MONs? Jan > On 29 Dec 2015, at 21:41, Ing. Martin Samek > wrote: > > Hi, > > network is OK, all nodes are in one VLAN, in one switch, in o

Re: [ceph-users] ubuntu 14.04 or centos 7

2015-12-29 Thread Jan Schermer
If you need to get Ceph up&running as part of some enterprise project, where you can't touch it after it gets in production without PM approval and a Change Request then CentOS is what fits in that box. You are unlikely to break anything during upgrades and you get new hardware support and (not

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Jan Schermer
Even with 10G ethernet, the bottleneck is not the network, nor the drives (assuming they are datacenter-class). The bottleneck is the software. The only way to improve that is to either increase CPU speed (more GHz per core) or to simplify the datapath IO has to take before it is considered dura

Re: [ceph-users] write speed , leave a little to be desired?

2015-12-11 Thread Jan Schermer
The drive will actually be writing 500MB/s in this case, if the journal is on the same drive. All writes get to the journal and then to the filestore, so 200MB/s is actually a sane figure. Jan > On 11 Dec 2015, at 13:55, Zoltan Arnold Nagy > wrote: > > It’s very unfortunate that you guys ar

Re: [ceph-users] [Ceph] Feature Ceph Geo-replication

2015-12-10 Thread Jan Schermer
If you don't need synchronnous replication then asynchronnous is the way to go, but Ceph doesn't offer that natively. (not for RBD anyway, not sure how radosgw could be set up). 200km will add at least 1ms of latency network-wise, 2ms RTT, for TCP it will be more. For sync replication (which ce

Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer
> On 10 Dec 2015, at 15:14, Sage Weil wrote: > > On Thu, 10 Dec 2015, Jan Schermer wrote: >> Removing snapshot means looking for every *potential* object the snapshot >> can have, and this takes a very long time (6TB snapshot will consist of 1.5M >> objects (i

Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer
Removing snapshot means looking for every *potential* object the snapshot can have, and this takes a very long time (6TB snapshot will consist of 1.5M objects (in one replica) assuming the default 4MB object size). The same applies to large thin volumes (don't try creating and then dropping a 1

Re: [ceph-users] Blocked requests after "osd in"

2015-12-10 Thread Jan Schermer
Just try to give the booting OSD and all MONs the resources they ask for (CPU, memory). Yes, it causes disruption but only for a select group of clients, and only for a moment (<20s with my extremely high number of PGs). From a service provider perspective this might break SLAs, but until you get

Re: [ceph-users] CephFS: number of PGs for metadata pool

2015-12-09 Thread Jan Schermer
Number of PGs doesn't affect the number of replicas, so don't worry about it. Jan > On 09 Dec 2015, at 13:03, Mykola Dvornik wrote: > > Hi guys, > > I am creating a 4-node/16OSD/32TB CephFS from scratch. > > According to the ceph documentation the metadata pool should have small > amount of

Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Jan Schermer
Are you seeing "peering" PGs when the blocked requests are happening? That's what we see regularly when starting OSDs. I'm not sure this can be solved completely (and whether there are major improvements in newer Ceph versions), but it can be sped up by 1) making sure you have free (and not dirt

Re: [ceph-users] ceph snapshost

2015-12-08 Thread Jan Schermer
You don't really *have* to stop I/O. In fact, I recommend you don't unless you have to. The reason why this is recommended is to minimize the risk of data loss because the snapshot will be in a very similiar state as if you suddenly lost power to the server. Obviously if you need to have the sam

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Jan Schermer
un at any given time but can't (because they are waiting for whatever they need - disks, CPU, blocking sockets...). Jan > > Thx > > Benedikt > > > 2015-12-08 8:44 GMT+01:00 Jan Schermer : >> >> Jan >> >> >>> On 08 Dec 2015,

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
Doesn't look near the limit currently (but I suppose you rebooted it in the meantime?). Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is that your data drives?) - were they overloaded really? Jan > On 08 Dec 2015, at 08:41, Benedikt Fraunhofer wrote: > > Hi Jan, > >

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
you provided. > > Thx! > > Benedikt > > 2015-12-08 8:15 GMT+01:00 Jan Schermer : >> What is the setting of sysctl kernel.pid_max? >> You relly need to have this: >> kernel.pid_max = 4194304 >> (I think it also sets this as well: kernel.threads-max = 419430

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
What is the setting of sysctl kernel.pid_max? You relly need to have this: kernel.pid_max = 4194304 (I think it also sets this as well: kernel.threads-max = 4194304) I think you are running out of processs IDs. Jan > On 08 Dec 2015, at 08:10, Benedikt Fraunhofer wrote: > > Hello Cephers, > >

Re: [ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Jan Schermer
The rule of thumb is that the data on OSD is gone if the related journal is gone. Journal doesn't just "vanish", though, so you should investigate further... This log is from the new empty journal, right? Jan > On 08 Dec 2015, at 08:08, Benedikt Fraunhofer wrote: > > Hello List, > > after so

Re: [ceph-users] New cluster performance analysis

2015-12-04 Thread Jan Schermer
> On 04 Dec 2015, at 14:31, Adrien Gillard wrote: > > After some more tests : > > - The pool being used as cache pool has no impact on performance, I get the > same results with a "dedicated" replicated pool. > - You are right Jan, on raw devices I get better performance on a volume if > I

Re: [ceph-users] ceph-osd@.service does not mount OSD data disk

2015-12-03 Thread Jan Schermer
echo add >/sys/block/sdX/sdXY/uevent The easiest way to make it mount automagically Jan > On 03 Dec 2015, at 20:31, Timofey Titovets wrote: > > Lol, it's opensource guys > https://github.com/ceph/ceph/tree/master/systemd > ceph-disk@ > > 2015-12-03 21:59 GMT+03:00 Florent B : >> "ceph" servic

Re: [ceph-users] How long will the logs be kept?

2015-12-03 Thread Jan Schermer
You can setup logrotate however you want - not sure what the default is for your distro. Usually logrotate doesn't touch files that are smaller than some size even if they are old. It will also not delete logs for OSDs that no longer exist. Ceph itself has nothing to do with log rotation, logro

Re: [ceph-users] New cluster performance analysis

2015-12-02 Thread Jan Schermer
> Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS > (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which > is far from rados bench (538) and fio (847). And surprisingly fio numbers are > greater than rados. > I think the missing factor here i

Re: [ceph-users] how to mount a bootable VM image file?

2015-12-02 Thread Jan Schermer
There's a pretty cool thing caled libguestfs, and a tool called guestfish http://libguestfs.org I've never used it (just stumbled on it recently) but it should do exactly what you need :-) And it supports RBD. Jan > On 02 Dec 2015, at 18:07, Gregory Farnum wrote: >

Re: [ceph-users] Removing OSD - double rebalance?

2015-12-02 Thread Jan Schermer
1) if you have the original drive that works and just want to replace it then you can just "dd" it over to the new drive and then extend the partition if the new one is larger, this avoids double backfilling in this case 2) if the old drive is dead you should "out" it and at the same time add a n

Re: [ceph-users] Modification Time of RBD Images

2015-11-26 Thread Jan Schermer
Find in which block the filesystem on your RBD image stores journal, find the object hosting this block in rados and use its mtime :-) Jan > On 26 Nov 2015, at 18:49, Gregory Farnum wrote: > > I don't think anything tracks this explicitly for RBD, but each RADOS object > does maintain an mti

Re: [ceph-users] Vierified and tested SAS/SATA SSD for Ceph

2015-11-24 Thread Jan Schermer
Intel DC series (S3610 for journals, S3510 might be OK for data). Samsung DC PRO series (if you can get them). There are other drives that might be suitable but I strongly suggest you avoid those that aren't tested by others - it's a PITA to deal with the problems poor SSDs cause. Jan > On 24

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Jan Schermer
So I assume we _are_ talking about bit-rot? > On 23 Nov 2015, at 18:37, Jose Tavares wrote: > > Yes, but with SW-RAID, when we have a block that was read and does not > match its checksum, the device falls out of the array, and the data is read > again from the other devices in the array. That'

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Jan Schermer
SW-RAID doesn't help with bit-rot if that's what you're afraid of. If you are afraid bit-rot you need to use a fully checksumming filesystem like ZFS. Ceph doesn't help there either when using replicas - not sure how strong error detection+correction is in EC-type pools. The only thing I can sug

Re: [ceph-users] what's the benefit if I deploy more ceph-mon node?

2015-11-19 Thread Jan Schermer
There's no added benefit - it just adds resiliency. On the other hand - more monitors means more likelihood that one of them will break, when that happens there will be a brief interruption to some (not only management) operations. If you decide to reduce the number of MONs then that is a PITA a

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
Apologies, it seems that to shrink the device a parameter --allow-shrink must be used. > On 12 Nov 2015, at 22:49, Jan Schermer wrote: > > xfs_growfs "autodetects" the block device size. You can force re-read of the > block device to refresh this info but might

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
n how to prevent this type of issues, in the > future? Should the resizing and the xfs_growfs be executed with some > parameters, for a better configuration of the image and / or filesystem? > > Thank you very much for your help! > > Regards, > Bogdan > > > On

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
nning rbd resize <http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/> > and then 'xfs_growfs -d' on the filesystem. > > Is there a better way to resize an RBD image and the filesystem? > > On Thu, Nov 12, 2015 at 10:35 PM, Jan Schermer <mailto:j...@schermer

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
testing data. If I won't > succeed repairing it I will trash and re-create it, of course. > > Thank you, once again! > > > > On Thu, Nov 12, 2015 at 9:28 PM, Jan Schermer <mailto:j...@schermer.cz>> wrote: > How did you create filesystems and/or partitions

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
How did you create filesystems and/or partitions on this RBD block device? The obvious causes would be 1) you partitioned it and the partition on which you ran mkfs points or pointed during mkfs outside the block device size (happens if you for example automate this and confuse sectors x cylinder

Re: [ceph-users] Chown in Parallel

2015-11-10 Thread Jan Schermer
.0085.50 158.00 176.00 852.00 2120.0017.80 > 144.49 500.045.05 944.40 2.99 100.00 > sdl 0.00 0.00 173.009.50 956.00 300.0013.76 > 2.85 15.645.73 196.00 5.41 98.80 > > From: ceph-users [mailto:ceph-users-boun.

Re: [ceph-users] Chown in Parallel

2015-11-10 Thread Jan Schermer
I would just disable barriers and enable them afterwards(+sync), should be a breeze then. Jan > On 10 Nov 2015, at 12:58, Nick Fisk wrote: > > I’m currently upgrading to Infernalis and the chown stage is taking a log > time on my OSD nodes. I’ve come up with this little one liner to run the

Re: [ceph-users] data size less than 4 mb

2015-11-02 Thread Jan Schermer
> On 02 Nov 2015, at 11:59, Wido den Hollander wrote: > > > > On 02-11-15 11:56, Jan Schermer wrote: >> Can those hints be disabled somehow? I was battling XFS preallocation >> the other day, and the mount option didn't make any difference - maybe >> beca

  1   2   3   4   >