Re: [ceph-users] Deprecating ext4 support

Oliver Dzombic Tue, 12 Apr 2016 14:27:48 -0700

Hi Jan,

i can answer your question very quickly: We.


We need that!

We need and want a stable, selfhealing, scaleable, robust, reliable
storagesystem which can talk to our infrastructure in different languages.

I have full understanding, that people who are using an infrastructure,
which is going to loose support by a software are not too much amused.

I dont understand your strict insisting on looking at that matter from
different points of view.

And if you will just think about it for a moment, you will remember
yourself, that this software is not designed for a single purpose.

Its designed for multiple purposes. Where "purpose" is the different
flavour/ways the different people are trying to use a software for.

I am very thankful, if software designers are trying to make their
product better and better. If that means that they will have to drop the
support for a filesystem type, then may it be so.

You will not die from that, as well as all others.

I am waiting for the upcoming jewel to make a new cluster, to migrate
the old hammer cluster into that.

Jewel will have a new feature that will allow to migrate clusters.

So whats your problem ? For now i dont see any draw back for you.

If the software will be able to provide your rbd vm's, then you should
not care about if its ext2,3,4,200 or xfs or $what_ever_new.

As long as its working, and maybe even providing more features than
before, then, whats the problem ?

That YOU dont need that features ? That you dont want your running
system to be changed ? That you are not the only ceph user and the
software is not privately developed for your neeeds ?

Seriously ?

So, let me welcome to this world, where you are not alone, and where are
other people who also have wishes and wantings.

I am sure that the people who soo much need/want to have the ext4
support are in the minority. Otherwise the ceph developers wont drop it,
because they are not stupid to drop a feature which is wanted/needed by
a majority of people.

So please, try to open your eyes a bit for the rest of the ceph users.

And, if you managed that, try to open your eyes for the ceph developers
who made here a product that was enabling you to manage your stuff and
what ever you use ceph for.

And if that is all not ok/right from your side, then become a ceph
developer and code contributor. Keep up the ext4 support and try to
influence the other developers to maintain a feature with is technically
not needed, technically in the way of better software design and used by
a minority of users. Goood luck with that !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 12.04.2016 um 22:33 schrieb Jan Schermer:
> Still the answer to most of your points from me is "but who needs that?"
> Who needs to have exactly the same data in two separate objects (replicas)? 
> Ceph needs it because "consistency"?, but the app (VM filesystem) is fine 
> with whatever version because the flush didn't happen (if it did the contents 
> would be the same).
> 
> You say "Ceph needs", but I say "the guest VM needs" - there's the problem.
> 
>> On 12 Apr 2016, at 21:58, Sage Weil <s...@newdream.net> wrote:
>>
>> Okay, I'll bite.
>>
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>>> Local kernel file systems maintain their own internal consistency, but 
>>>> they only provide what consistency promises the POSIX interface 
>>>> does--which is almost nothing.
>>>
>>> ... which is exactly what everyone expects
>>> ... which is everything any app needs
>>>
>>>> That's why every complicated data 
>>>> structure (e.g., database) stored on a file system ever includes it's own 
>>>> journal.
>>> ... see?
>>
>> They do this because POSIX doesn't give them what they want.  They 
>> implement a *second* journal on top.  The result is that you get the 
>> overhead from both--the fs journal keeping its data structures consistent, 
>> the database keeping its consistent.  If you're not careful, that means 
>> the db has to do something like file write, fsync, db journal append, 
>> fsync.
> It's more like
> transaction log write, flush
> data write
> That's simply because most filesystems don't journal data, but some do.
> 
> 
>> And both fsyncs turn into a *fs* journal io and flush.  (Smart 
>> databases often avoid most of the fs overhead by putting everything in a 
>> single large file, but at that point the file system isn't actually doing 
>> anything except passing IO to the block layer).
>>
>> There is nothing wrong with POSIX file systems.  They have the unenviable 
>> task of catering to a huge variety of workloads and applications, but are 
>> truly optimal for very few.  And that's fine.  If you want a local file 
>> system, you should use ext4 or XFS, not Ceph.
>>
>> But it turns ceph-osd isn't a generic application--it has a pretty 
>> specific workload pattern, and POSIX doesn't give us the interfaces we 
>> want (mainly, atomic transactions or ordered object/file enumeration).
> 
> The workload (with RBD) is inevitably expecting POSIX. Who needs more than 
> that? To me that indicates unnecessary guarantees.
> 
>>
>>>> We coudl "wing it" and hope for 
>>>> the best, then do an expensive crawl and rsync of data on recovery, but we 
>>>> chose very early on not to do that.  If you want a system that "just" 
>>>> layers over an existing filesystem, try you can try Gluster (although note 
>>>> that they have a different sort of pain with the ordering of xattr 
>>>> updates, and are moving toward a model that looks more like Ceph's backend 
>>>> in their next version).
>>>
>>> True, which is why we dismissed it.
>>
>> ...and yet it does exactly what you asked for:
> 
> I was implying it suffers the same flaws. In any case it wasn't really fast 
> and it seemed overly complex.
> To be fair it was some while ago when I tried it.
> Can't talk about consistency - I don't think I ever used it in production as 
> more than a PoC.
> 
>>
>>>>> IMO, If Ceph was moving in the right direction [...] Ceph would 
>>>>> simply distribute our IO around with CRUSH.
>>
>> You want ceph to "just use a file system."  That's what gluster does--it 
>> just layers the distributed namespace right on top of a local namespace.  
>> If you didn't care about correctness or data safety, it would be 
>> beautiful, and just as fast as the local file system (modulo network).  
>> But if you want your data safe, you immediatley realize that local POSIX 
>> file systems don't get you want you need: the atomic update of two files 
>> on different servers so that you can keep your replicas in sync.  Gluster 
>> originally took the minimal path to accomplish this: a "simple" 
>> prepare/write/commit, using xattrs as transaction markers.  We took a 
>> heavyweight approach to support arbitrary transactions.  And both of us 
>> have independently concluded that the local fs is the wrong tool for the 
>> job.
>>
>>>> Offloading stuff to the file system doesn't save you CPU--it just makes 
>>>> someone else responsible.  What does save you CPU is avoiding the 
>>>> complexity you don't need (i.e., half of what the kernel file system is 
>>>> doing, and everything we have to do to work around an ill-suited 
>>>> interface) and instead implement exactly the set of features that we need 
>>>> to get the job done.
>>>
>>> In theory you are right.
>>> In practice in-kernel filesystems are fast, and fuse filesystems are slow.
>>> Ceph is like that - slow. And you want to be fast by writing more code :)
>>
>> You get fast by writing the *right* code, and eliminating layers of the 
>> stack (the local file system, in this case) that are providing 
>> functionality you don't want (or more functionality than you need at too 
>> high a price).
>>
>>> I dug into bluestore and how you want to implement it, and from what I 
>>> understood you are reimplementing what the filesystem journal does...
>>
>> Yes.  The difference is that a single journal manages all of the metadata 
>> and data consistency in the system, instead of a local fs journal managing 
>> just block allocation and a second ceph journal managing ceph's data 
>> structures.
>>
>> The main benefit, though, is that we can choose a different set of 
>> semantics, like the ability to overwrite data in a file/object and update 
>> metadata atomically.  You can't do that with POSIX without building a 
>> write-ahead journal and double-writing.
>>
>>> Btw I think at least i_version xattr could be atomic.
>>
>> Nope.  All major file systems (other than btrfs) overwrite data in place, 
>> which means it is impossible for any piece of metadata to accurately 
>> indicate whether you have the old data or the new data (or perhaps a bit 
>> of both).
>>
>>> It makes sense it will be 2x faster if you avoid the double-journalling, 
>>> but I'd be very much surprised if it helped with CPU usage one bit - I 
>>> certainly don't see my filesystems consuming significant amount of CPU 
>>> time on any of my machines, and I seriously doubt you're going to do 
>>> that better, sorry.
>>
>> Apples and oranges.  The file systems aren't doing what we're doing.  But 
>> once you combine the what we spend now in FileStore + a local fs, 
>> BlueStore will absolutely spend less CPU time.
> 
> I don't think it's apples and oranges.
> If I export two files via losetup over iSCSI and make a raid1 swraid out of 
> them in guest VM, I bet it will still be faster than ceph with bluestore.
> And yet it will provide the same guarantees and do the same job without 
> eating significant CPU time.
> True or false?
> Yes, the filesystem is unnecessary in this scenario, but the performance 
> impact is negligible if you use it right.
> 
>>
>>> What makes you think you will do a better job than all the people who 
>>> made xfs/ext4/...?
>>
>> I don't.  XFS et al are great file systems and for the most part I have no 
>> complaints about them.  The problem is that Ceph doesn't need a file 
>> system: it needs a transactional object store with a different set of 
>> features.  So that's what we're building.
>>
>> sage
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

Reply via email to