-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I think if there was a new disk format, we could get away without the journal. It seems that Ceph is trying to do extra things because regular file systems don't do exactly what is needed. I can understand why the developers aren't excited about building and maintaining a new disk format, but I think it could be pretty light and highly optimized for object storage. I even started thinking through what one might look like, but I've never written a file system so I'm probably just living in a fantasy land. I still might try... - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Mon, Oct 19, 2015 at 12:18 PM, Jan Schermer wrote: > I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what > other people using Ceph think. > > If I were to use RADOS directly in my app I'd probably rejoice at its > capabilities and how useful and non-legacy it is, but my use is basically for > RBD volumes with OpenStack (libvirt, qemu...). And for that those > capabilities are unneeded. > I live in this RBD bubble so that's all I know, but isn't this also the only > usage pattern that 90% (or more) people using Ceph care about? Isn't this > what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it > comes to displacing traditional (DAS, SAN, NAS) solutions the overhead > (=complexity) of Ceph?* > > What are the apps that actually use the RADOS features? I know Swift has some > RADOS backend (which does the same thing Swift already did by itself, maybe > with stronger consistency?), RGW (which basically does the same as Swift?) - > doesn't seem either of those would need anything special. What else is there? > Apps that needed more than POSIX semantics (like databases for transactions) > already developed mechanisms to do that - how likely is my database server to > replace those mechanisms with RADOS API and objects in the future? It's all > posix-filesystem-centric and that's not going away. > > Ceph feels like a perfect example of this > https://en.wikipedia.org/wiki/Inner-platform_effect > > I was really hoping there was an easy way to just get rid of journal and > operate on filestore directly - that should suffice for anyone using RBD only > (in fact until very recently I thought it was possible to just disable > journal in config...) > > Jan > > * look at what other solutions do to get better performance - RDMA for > example. You can't really get true RDMA performance if you're not touching > the drive DMA buffer (or something else very close to data) over network > directly with minimal latency. That doesn't (IMHO) preclude > software-defined-storage like Ceph from working over RDMA, but you probably > should't try to outsmart the IO patterns... > >> On 19 Oct 2015, at 19:44, James (Fei) Liu-SSI wrote: >> >> Hi John, >> Thanks for your explanations. >> >> Actually, clients can. Clients can request fairly complex operations >> like "read an xattr, stop if it's not there, now write the following >> discontinuous regions of the file...". RADOS executes these transactions >> atomically. >> [James] Could you mind detailing a little bit more about operations in >> Rados transactions? Is there any limits number of ops in one rados >> transaction? What if we come out similar transaction capabilities either in >> new file system or keyvalue store to map what rados transaction has? If we >> can come out solution like what Jan proposed: 1:1 mapping for transactions >> between filesystem/keyvaluestore, we don't necessary to have journaling in >> objectstore which is going to dramatically improve the performance of Ceph. >> >> Thanks. >> >> Regards, >> James >> >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> John Spray >> Sent: Monday, October 19, 2015 3:44 AM >> To: Jan Schermer >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes? >> >> On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer wrote: >>> I understand this. But the clients can't request something that >>> doesn't fit a (POSIX) filesystem capabilities >> >> Actually, clients can. Clients can request fairly complex operations like >> "read an xattr, stop if it's not there, now write the following >> discontinuous regions of the file...". RADOS executes these transactions >> atomically. >> >> However, you are correct that for many cases (new files, sequential >> writes) it is possible to avoid the double write of data: the in-development >> newstore backend does that. But we still have cases where we do fancier >> things than the backend (be it posix, or a KV >> store) can handle, so will have non-fast-path higher overhead ways of >> handling it. >> >> John >> >> That means the requests can map 1:1 into the filestore (O_FSYNC from client >> == O_FSYNC on the filestore object... ). >> Pagecache/io-schedulers are already smart enough to merge requests, preserve >> ordering - they just do the right thing already. It's true that in a >> distributed environment one async request can map to one OSD and then a >> synchronous one comes and needs the first one to be flushed beforehand, so >> that logic is presumably in place already - but I still don't see much need >> for a journal in there (btw in case of RBD with caching, this logic is >> probably not even needed at all and merging request in RBD cache makes more >> sense than merging somewhere down the line). >>> It might be faster to merge small writes in journal when the journal is on >>> SSDs and filestore on spinning rust, but it will surely be slower (cpu >>> bound by ceph-osd?) when the filestore is fast enough or when the merging >>> is not optimal. >>> I have never touched anything but a pure SSD cluster, though - I have >>> always been CPU bound and that's why I started thinking about this in the >>> first place. I'd love to have my disks saturated with requests from clients >>> one day. >>> >>> Don't take this the wrong way, but I've been watching ceph perf talks and >>> stuff and haven't seen anything that would make Ceph comparably fast to an >>> ordinary SAN/NAS. >>> Maybe this is a completely wrong idea, I just think it might be worth >>> thinking about. >>> >>> Thanks >>> >>> Jan >>> >>> >>>> On 14 Oct 2015, at 20:29, Somnath Roy wrote: >>>> >>>> FileSystem like XFS guarantees a single file write but in Ceph transaction >>>> we are touching file/xattrs/leveldb (omap), so no way filesystem can >>>> guarantee that transaction. That's why FileStore has implemented a >>>> write_ahead journal. Basically, it is writing the entire transaction >>>> object there and only trimming from journal when it is actually applied >>>> (all the operation executed) and persisted in the backend. >>>> >>>> Thanks & Regards >>>> Somnath >>>> >>>> -----Original Message----- >>>> From: Jan Schermer [mailto:j...@schermer.cz] >>>> Sent: Wednesday, October 14, 2015 9:06 AM >>>> To: Somnath Roy >>>> Cc: ceph-users@lists.ceph.com >>>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant >>>> sometimes? >>>> >>>> But that's exactly what filesystems and their own journals do already >>>> :-) >>>> >>>> Jan >>>> >>>>> On 14 Oct 2015, at 17:02, Somnath Roy wrote: >>>>> >>>>> Jan, >>>>> Journal helps FileStore to maintain the transactional integrity in the >>>>> event of a crash. That's the main reason. >>>>> >>>>> Thanks & Regards >>>>> Somnath >>>>> >>>>> -----Original Message----- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >>>>> Behalf Of Jan Schermer >>>>> Sent: Wednesday, October 14, 2015 2:28 AM >>>>> To: ceph-users@lists.ceph.com >>>>> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes? >>>>> >>>>> Hi, >>>>> I've been thinking about this for a while now - does Ceph really need a >>>>> journal? Filesystems are already pretty good at committing data to disk >>>>> when asked (and much faster too), we have external journals in XFS and >>>>> Ext4... >>>>> In a scenario where client does an ordinary write, there's no need to >>>>> flush it anywhere (the app didn't ask for it) so it ends up in pagecache >>>>> and gets committed eventually. >>>>> If a client asks for the data to be flushed then fdatasync/fsync on the >>>>> filestore object takes care of that, including ordering and stuff. >>>>> For reads, you just read from filestore (no need to differentiate between >>>>> filestore/journal) - pagecache gives you the right version already. >>>>> >>>>> Or is journal there to achieve some tiering for writes when the running >>>>> spindles with SSDs? This is IMO the only thing ordinary filesystems don't >>>>> do out of box even when filesystem journal is put on SSD - the data get >>>>> flushed to spindle whenever fsync-ed (even with data=journal). But in >>>>> reality, most of the data will hit the spindle either way and when you >>>>> run with SSDs it will always be much slower. And even for tiering - there >>>>> are already many options (bcache, flashcache or even ZFS L2ARC) that are >>>>> much more performant and proven stable. I think the fact that people >>>>> have a need to combine Ceph with stuff like that already proves the point. >>>>> >>>>> So a very interesting scenario would be to disable Ceph journal and at >>>>> most use data=journal on ext4. The complexity of the data path would drop >>>>> significantly, latencies decrease, CPU time is saved... >>>>> I just feel that Ceph has lots of unnecessary complexity inside that >>>>> duplicates what filesystems (and pagecache...) have been doing for a >>>>> while now without eating most of our CPU cores - why don't we use that? >>>>> Is it possible to disable journal completely? >>>>> >>>>> Did I miss something that makes journal essential? >>>>> >>>>> Jan >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> ________________________________ >>>>> >>>>> PLEASE NOTE: The information contained in this electronic mail message is >>>>> intended only for the use of the designated recipient(s) named above. If >>>>> the reader of this message is not the intended recipient, you are hereby >>>>> notified that you have received this message in error and that any >>>>> review, dissemination, distribution, or copying of this message is >>>>> strictly prohibited. If you have received this communication in error, >>>>> please notify the sender by telephone or e-mail (as shown above) >>>>> immediately and destroy any and all copies of this message in your >>>>> possession (whether hard copies or electronically stored copies). >>>>> >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.2.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWJUHmCRDmVDuy+mK58QAAZ2oP/R6DTxW9kT9S3nMvBZWO mmZW87NpuwyD23YYg9YZA1GqplW0zOgH0H3oOP1gplUWQgmmYZL66Kk1etr1 z1sWUWhUa/BREEwCmbMLbGAW1wAe3XKqQz837OhQca5+u0rNj+1R01JnFuUx MrPqJ7gY7T3wf9n/QkkUuAtg/J/OBkwAAz4BrBamgDWR3r9KcCix1OdN+lsV Dj45ilFGjMLPU7UhxL76a0I03r+QC22x3MwiZe5uKZINOkw+2Ecy2NalKtxP llalidp7m3MWQ426GWdThZAtW/PtBGHcGxb64Dj/3sdzZmm2rLErpzfpugCd rVmUssaEhJSOhGp1xS8Upp6sgesnbY/gpL9BX6qes90mBWKOuhRPUqMc0WzM lodsRDgE97/0sa552q3PWzh0I6CDjJHsMwLlIm0SdBFlh1f389UnShLZcS2D N+y07r/vfMZ7P1apEHhwcnxDknJHhh78Yi45CWKGZL5lYJnOGU5ZozKbcIn5 VL3SNiGorGZKulrMTKuSa3FI74WsV8lelI+MvHrkm3XyGO5EVgZbR6vh05jk hVM8wDe6gP9Z/T+pJuTIfhbw9bJGmR/BUCZ0Cm/3xBvJqFO98T3L52fYqmot wjPE7lnDCliq0ylPGb0ScToRWRA+Pl/JsWU0XmWe7OSafj3nIwgQmyaVyrUC jHLd =8cl9 -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com