Hi Jan, i can answer your question very quickly: We.
We need that! We need and want a stable, selfhealing, scaleable, robust, reliable storagesystem which can talk to our infrastructure in different languages. I have full understanding, that people who are using an infrastructure, which is going to loose support by a software are not too much amused. I dont understand your strict insisting on looking at that matter from different points of view. And if you will just think about it for a moment, you will remember yourself, that this software is not designed for a single purpose. Its designed for multiple purposes. Where "purpose" is the different flavour/ways the different people are trying to use a software for. I am very thankful, if software designers are trying to make their product better and better. If that means that they will have to drop the support for a filesystem type, then may it be so. You will not die from that, as well as all others. I am waiting for the upcoming jewel to make a new cluster, to migrate the old hammer cluster into that. Jewel will have a new feature that will allow to migrate clusters. So whats your problem ? For now i dont see any draw back for you. If the software will be able to provide your rbd vm's, then you should not care about if its ext2,3,4,200 or xfs or $what_ever_new. As long as its working, and maybe even providing more features than before, then, whats the problem ? That YOU dont need that features ? That you dont want your running system to be changed ? That you are not the only ceph user and the software is not privately developed for your neeeds ? Seriously ? So, let me welcome to this world, where you are not alone, and where are other people who also have wishes and wantings. I am sure that the people who soo much need/want to have the ext4 support are in the minority. Otherwise the ceph developers wont drop it, because they are not stupid to drop a feature which is wanted/needed by a majority of people. So please, try to open your eyes a bit for the rest of the ceph users. And, if you managed that, try to open your eyes for the ceph developers who made here a product that was enabling you to manage your stuff and what ever you use ceph for. And if that is all not ok/right from your side, then become a ceph developer and code contributor. Keep up the ext4 support and try to influence the other developers to maintain a feature with is technically not needed, technically in the way of better software design and used by a minority of users. Goood luck with that ! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 12.04.2016 um 22:33 schrieb Jan Schermer: > Still the answer to most of your points from me is "but who needs that?" > Who needs to have exactly the same data in two separate objects (replicas)? > Ceph needs it because "consistency"?, but the app (VM filesystem) is fine > with whatever version because the flush didn't happen (if it did the contents > would be the same). > > You say "Ceph needs", but I say "the guest VM needs" - there's the problem. > >> On 12 Apr 2016, at 21:58, Sage Weil <s...@newdream.net> wrote: >> >> Okay, I'll bite. >> >> On Tue, 12 Apr 2016, Jan Schermer wrote: >>>> Local kernel file systems maintain their own internal consistency, but >>>> they only provide what consistency promises the POSIX interface >>>> does--which is almost nothing. >>> >>> ... which is exactly what everyone expects >>> ... which is everything any app needs >>> >>>> That's why every complicated data >>>> structure (e.g., database) stored on a file system ever includes it's own >>>> journal. >>> ... see? >> >> They do this because POSIX doesn't give them what they want. They >> implement a *second* journal on top. The result is that you get the >> overhead from both--the fs journal keeping its data structures consistent, >> the database keeping its consistent. If you're not careful, that means >> the db has to do something like file write, fsync, db journal append, >> fsync. > It's more like > transaction log write, flush > data write > That's simply because most filesystems don't journal data, but some do. > > >> And both fsyncs turn into a *fs* journal io and flush. (Smart >> databases often avoid most of the fs overhead by putting everything in a >> single large file, but at that point the file system isn't actually doing >> anything except passing IO to the block layer). >> >> There is nothing wrong with POSIX file systems. They have the unenviable >> task of catering to a huge variety of workloads and applications, but are >> truly optimal for very few. And that's fine. If you want a local file >> system, you should use ext4 or XFS, not Ceph. >> >> But it turns ceph-osd isn't a generic application--it has a pretty >> specific workload pattern, and POSIX doesn't give us the interfaces we >> want (mainly, atomic transactions or ordered object/file enumeration). > > The workload (with RBD) is inevitably expecting POSIX. Who needs more than > that? To me that indicates unnecessary guarantees. > >> >>>> We coudl "wing it" and hope for >>>> the best, then do an expensive crawl and rsync of data on recovery, but we >>>> chose very early on not to do that. If you want a system that "just" >>>> layers over an existing filesystem, try you can try Gluster (although note >>>> that they have a different sort of pain with the ordering of xattr >>>> updates, and are moving toward a model that looks more like Ceph's backend >>>> in their next version). >>> >>> True, which is why we dismissed it. >> >> ...and yet it does exactly what you asked for: > > I was implying it suffers the same flaws. In any case it wasn't really fast > and it seemed overly complex. > To be fair it was some while ago when I tried it. > Can't talk about consistency - I don't think I ever used it in production as > more than a PoC. > >> >>>>> IMO, If Ceph was moving in the right direction [...] Ceph would >>>>> simply distribute our IO around with CRUSH. >> >> You want ceph to "just use a file system." That's what gluster does--it >> just layers the distributed namespace right on top of a local namespace. >> If you didn't care about correctness or data safety, it would be >> beautiful, and just as fast as the local file system (modulo network). >> But if you want your data safe, you immediatley realize that local POSIX >> file systems don't get you want you need: the atomic update of two files >> on different servers so that you can keep your replicas in sync. Gluster >> originally took the minimal path to accomplish this: a "simple" >> prepare/write/commit, using xattrs as transaction markers. We took a >> heavyweight approach to support arbitrary transactions. And both of us >> have independently concluded that the local fs is the wrong tool for the >> job. >> >>>> Offloading stuff to the file system doesn't save you CPU--it just makes >>>> someone else responsible. What does save you CPU is avoiding the >>>> complexity you don't need (i.e., half of what the kernel file system is >>>> doing, and everything we have to do to work around an ill-suited >>>> interface) and instead implement exactly the set of features that we need >>>> to get the job done. >>> >>> In theory you are right. >>> In practice in-kernel filesystems are fast, and fuse filesystems are slow. >>> Ceph is like that - slow. And you want to be fast by writing more code :) >> >> You get fast by writing the *right* code, and eliminating layers of the >> stack (the local file system, in this case) that are providing >> functionality you don't want (or more functionality than you need at too >> high a price). >> >>> I dug into bluestore and how you want to implement it, and from what I >>> understood you are reimplementing what the filesystem journal does... >> >> Yes. The difference is that a single journal manages all of the metadata >> and data consistency in the system, instead of a local fs journal managing >> just block allocation and a second ceph journal managing ceph's data >> structures. >> >> The main benefit, though, is that we can choose a different set of >> semantics, like the ability to overwrite data in a file/object and update >> metadata atomically. You can't do that with POSIX without building a >> write-ahead journal and double-writing. >> >>> Btw I think at least i_version xattr could be atomic. >> >> Nope. All major file systems (other than btrfs) overwrite data in place, >> which means it is impossible for any piece of metadata to accurately >> indicate whether you have the old data or the new data (or perhaps a bit >> of both). >> >>> It makes sense it will be 2x faster if you avoid the double-journalling, >>> but I'd be very much surprised if it helped with CPU usage one bit - I >>> certainly don't see my filesystems consuming significant amount of CPU >>> time on any of my machines, and I seriously doubt you're going to do >>> that better, sorry. >> >> Apples and oranges. The file systems aren't doing what we're doing. But >> once you combine the what we spend now in FileStore + a local fs, >> BlueStore will absolutely spend less CPU time. > > I don't think it's apples and oranges. > If I export two files via losetup over iSCSI and make a raid1 swraid out of > them in guest VM, I bet it will still be faster than ceph with bluestore. > And yet it will provide the same guarantees and do the same job without > eating significant CPU time. > True or false? > Yes, the filesystem is unnecessary in this scenario, but the performance > impact is negligible if you use it right. > >> >>> What makes you think you will do a better job than all the people who >>> made xfs/ext4/...? >> >> I don't. XFS et al are great file systems and for the most part I have no >> complaints about them. The problem is that Ceph doesn't need a file >> system: it needs a transactional object store with a different set of >> features. So that's what we're building. >> >> sage > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com