-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I think if there was a new disk format, we could get away without the
journal. It seems that Ceph is trying to do extra things because
regular file systems don't do exactly what is needed. I can understand
why the developers aren't excited about building and maintaining a new
disk format, but I think it could be pretty light and highly optimized
for object storage. I even started thinking through what one might
look like, but I've never written a file system so I'm probably just
living in a fantasy land. I still might try...
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Oct 19, 2015 at 12:18 PM, Jan Schermer  wrote:
> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
> other people using Ceph think.
>
> If I were to use RADOS directly in my app I'd probably rejoice at its 
> capabilities and how useful and non-legacy it is, but my use is basically for 
> RBD volumes with OpenStack (libvirt, qemu...). And for that those 
> capabilities are unneeded.
> I live in this RBD bubble so that's all I know, but isn't this also the only 
> usage pattern that 90% (or more) people using Ceph care about? Isn't this 
> what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it 
> comes to displacing traditional (DAS, SAN, NAS) solutions the overhead 
> (=complexity) of Ceph?*
>
> What are the apps that actually use the RADOS features? I know Swift has some 
> RADOS backend (which does the same thing Swift already did by itself, maybe 
> with stronger consistency?), RGW (which basically does the same as Swift?) - 
> doesn't seem either of those would need anything special. What else is there?
> Apps that needed more than POSIX semantics (like databases for transactions) 
> already developed mechanisms to do that - how likely is my database server to 
> replace those mechanisms with RADOS API and objects in the future? It's all 
> posix-filesystem-centric and that's not going away.
>
> Ceph feels like a perfect example of this 
> https://en.wikipedia.org/wiki/Inner-platform_effect
>
> I was really hoping there was an easy way to just get rid of journal and 
> operate on filestore directly - that should suffice for anyone using RBD only 
>  (in fact until very recently I thought it was possible to just disable 
> journal in config...)
>
> Jan
>
> * look at what other solutions do to get better performance - RDMA for 
> example. You can't really get true RDMA performance if you're not touching 
> the drive DMA buffer (or something else very close to data) over network 
> directly with minimal latency. That doesn't (IMHO) preclude 
> software-defined-storage like Ceph from working over RDMA, but you probably 
> should't try to outsmart the IO patterns...
>
>> On 19 Oct 2015, at 19:44, James (Fei) Liu-SSI  wrote:
>>
>> Hi John,
>>    Thanks for your explanations.
>>
>>    Actually, clients can.  Clients can request fairly complex operations 
>> like "read an xattr, stop if it's not there, now write the following 
>> discontinuous regions of the file...".  RADOS executes these transactions 
>> atomically.
>>    [James]  Could you mind detailing  a little bit more about operations in 
>> Rados transactions?  Is there any limits number of ops in one rados 
>> transaction? What if we come out similar transaction capabilities either in 
>> new file system or keyvalue store to map what rados transaction has?  If we 
>> can come out solution like what Jan proposed: 1:1 mapping for transactions 
>> between filesystem/keyvaluestore, we don't necessary to have journaling in 
>> objectstore which is going to dramatically improve the performance of Ceph.
>>
>> Thanks.
>>
>> Regards,
>> James
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> John Spray
>> Sent: Monday, October 19, 2015 3:44 AM
>> To: Jan Schermer
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>
>> On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer  wrote:
>>> I understand this. But the clients can't request something that
>>> doesn't fit a (POSIX) filesystem capabilities
>>
>> Actually, clients can.  Clients can request fairly complex operations like 
>> "read an xattr, stop if it's not there, now write the following 
>> discontinuous regions of the file...".  RADOS executes these transactions 
>> atomically.
>>
>> However, you are correct that for many cases (new files, sequential
>> writes) it is possible to avoid the double write of data: the in-development 
>> newstore backend does that.  But we still have cases where we do fancier 
>> things than the backend (be it posix, or a KV
>> store) can handle, so will have non-fast-path higher overhead ways of 
>> handling it.
>>
>> John
>>
>> That means the requests can map 1:1 into the filestore (O_FSYNC from client 
>> == O_FSYNC on the filestore object... ).
>> Pagecache/io-schedulers are already smart enough to merge requests, preserve 
>> ordering - they just do the right thing already. It's true that in a 
>> distributed environment one async request can map to one OSD and then a 
>> synchronous one comes and needs the first one to be flushed beforehand, so 
>> that logic is presumably in place already - but I still don't see much need 
>> for a journal in there (btw in case of RBD with caching, this logic is 
>> probably not even needed at all and merging request in RBD cache makes more 
>> sense than merging somewhere down the line).
>>> It might be faster to merge small writes in journal when the journal is on 
>>> SSDs and filestore on spinning rust, but it will surely be slower (cpu 
>>> bound by ceph-osd?) when the filestore is fast enough or when the merging 
>>> is not optimal.
>>> I have never touched anything but a pure SSD cluster, though - I have 
>>> always been CPU bound and that's why I started thinking about this in the 
>>> first place. I'd love to have my disks saturated with requests from clients 
>>> one day.
>>>
>>> Don't take this the wrong way, but I've been watching ceph perf talks and 
>>> stuff and haven't seen anything that would make Ceph comparably fast to an 
>>> ordinary SAN/NAS.
>>> Maybe this is a completely wrong idea, I just think it might be worth 
>>> thinking about.
>>>
>>> Thanks
>>>
>>> Jan
>>>
>>>
>>>> On 14 Oct 2015, at 20:29, Somnath Roy  wrote:
>>>>
>>>> FileSystem like XFS guarantees a single file write but in Ceph transaction 
>>>> we are touching file/xattrs/leveldb (omap), so no way filesystem can 
>>>> guarantee that transaction. That's why FileStore has implemented a 
>>>> write_ahead journal. Basically, it is writing the entire transaction 
>>>> object there and only trimming from journal when it is actually applied 
>>>> (all the operation executed) and persisted in the backend.
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Jan Schermer [mailto:j...@schermer.cz]
>>>> Sent: Wednesday, October 14, 2015 9:06 AM
>>>> To: Somnath Roy
>>>> Cc: ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant 
>>>> sometimes?
>>>>
>>>> But that's exactly what filesystems and their own journals do already
>>>> :-)
>>>>
>>>> Jan
>>>>
>>>>> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
>>>>>
>>>>> Jan,
>>>>> Journal helps FileStore to maintain the transactional integrity in the 
>>>>> event of a crash. That's the main reason.
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>> Behalf Of Jan Schermer
>>>>> Sent: Wednesday, October 14, 2015 2:28 AM
>>>>> To: ceph-users@lists.ceph.com
>>>>> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>>>>
>>>>> Hi,
>>>>> I've been thinking about this for a while now - does Ceph really need a 
>>>>> journal? Filesystems are already pretty good at committing data to disk 
>>>>> when asked (and much faster too), we have external journals in XFS and 
>>>>> Ext4...
>>>>> In a scenario where client does an ordinary write, there's no need to 
>>>>> flush it anywhere (the app didn't ask for it) so it ends up in pagecache 
>>>>> and gets committed eventually.
>>>>> If a client asks for the data to be flushed then fdatasync/fsync on the 
>>>>> filestore object takes care of that, including ordering and stuff.
>>>>> For reads, you just read from filestore (no need to differentiate between 
>>>>> filestore/journal) - pagecache gives you the right version already.
>>>>>
>>>>> Or is journal there to achieve some tiering for writes when the running 
>>>>> spindles with SSDs? This is IMO the only thing ordinary filesystems don't 
>>>>> do out of box even when filesystem journal is put on SSD - the data get 
>>>>> flushed to spindle whenever fsync-ed (even with data=journal). But in 
>>>>> reality, most of the data will hit the spindle either way and when you 
>>>>> run with SSDs it will always be much slower. And even for tiering - there 
>>>>> are already many options (bcache, flashcache or even ZFS L2ARC) that are 
>>>>> much more performant and proven stable. I think the fact that people  
>>>>> have a need to combine Ceph with stuff like that already proves the point.
>>>>>
>>>>> So a very interesting scenario would be to disable Ceph journal and at 
>>>>> most use data=journal on ext4. The complexity of the data path would drop 
>>>>> significantly, latencies decrease, CPU time is saved...
>>>>> I just feel that Ceph has lots of unnecessary complexity inside that 
>>>>> duplicates what filesystems (and pagecache...) have been doing for a 
>>>>> while now without eating most of our CPU cores - why don't we use that? 
>>>>> Is it possible to disable journal completely?
>>>>>
>>>>> Did I miss something that makes journal essential?
>>>>>
>>>>> Jan
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>>>> intended only for the use of the designated recipient(s) named above. If 
>>>>> the reader of this message is not the intended recipient, you are hereby 
>>>>> notified that you have received this message in error and that any 
>>>>> review, dissemination, distribution, or copying of this message is 
>>>>> strictly prohibited. If you have received this communication in error, 
>>>>> please notify the sender by telephone or e-mail (as shown above) 
>>>>> immediately and destroy any and all copies of this message in your 
>>>>> possession (whether hard copies or electronically stored copies).
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWJUHmCRDmVDuy+mK58QAAZ2oP/R6DTxW9kT9S3nMvBZWO
mmZW87NpuwyD23YYg9YZA1GqplW0zOgH0H3oOP1gplUWQgmmYZL66Kk1etr1
z1sWUWhUa/BREEwCmbMLbGAW1wAe3XKqQz837OhQca5+u0rNj+1R01JnFuUx
MrPqJ7gY7T3wf9n/QkkUuAtg/J/OBkwAAz4BrBamgDWR3r9KcCix1OdN+lsV
Dj45ilFGjMLPU7UhxL76a0I03r+QC22x3MwiZe5uKZINOkw+2Ecy2NalKtxP
llalidp7m3MWQ426GWdThZAtW/PtBGHcGxb64Dj/3sdzZmm2rLErpzfpugCd
rVmUssaEhJSOhGp1xS8Upp6sgesnbY/gpL9BX6qes90mBWKOuhRPUqMc0WzM
lodsRDgE97/0sa552q3PWzh0I6CDjJHsMwLlIm0SdBFlh1f389UnShLZcS2D
N+y07r/vfMZ7P1apEHhwcnxDknJHhh78Yi45CWKGZL5lYJnOGU5ZozKbcIn5
VL3SNiGorGZKulrMTKuSa3FI74WsV8lelI+MvHrkm3XyGO5EVgZbR6vh05jk
hVM8wDe6gP9Z/T+pJuTIfhbw9bJGmR/BUCZ0Cm/3xBvJqFO98T3L52fYqmot
wjPE7lnDCliq0ylPGb0ScToRWRA+Pl/JsWU0XmWe7OSafj3nIwgQmyaVyrUC
jHLd
=8cl9
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to