Dear ceph-users,

I read a lot of documentation today about ceph architecture and linux file 
system benchmarks in particular and I could not help notice something that I 
like to clear up for myself. Take into account that it has been a while that I 
actually touched linux, but I did some programming on php2b12 and apache back 
in the days so I'm not a complete newbie. The real question is below if you do 
not like reading the rest ;)

What I have come to understand about file systems for OSD's is that in theory 
btrfs is the file system of choice. However, due to its young age it's not 
considered stable yet. Therefore EXT4 but preferably XFS is used in most cases. 
It seems that most people choose this system because of its journaling feature 
and XFS for its additional attribute storage which has a 64kb limit which 
should be sufficient for most operations.

But when you look at file system benchmarks btrfs is really, really slow. Then 
comes XFS, then EXT4, but EXT2 really dwarfs all other throughput results. On 
journaling systems (like XFS, EXT4 and btrfs) disabling journaling actually 
helps throughput as well. Sometimes more then 2 times for write actions.

The preferred configuration for OSD's is one OSD per disk. Each object is 
striped among all Object Storage Daemons in a cluster. So if I would take one 
disk for the cluster and check its data, chances are slim that I will find a 
complete object there (a non-striped, full object I mean).

When a client issues an object write (I assume a full object/file write in this 
case) it is the client's responsibility to stripe it among the object storage 
daemons. When a stripe is successfully stored by the daemon an ACK signal is 
send to (?) the client and all participating OSD's. When all participating 
OSD's for the object have completed the client assumes all is well and returns 
control to the application

If I'm not mistaken, then journaling is meant for the rare occasions that a 
hardware failure will occur and the data is corrupted. Ceph does this too in 
another way of course. But ceph should be able to notice when a block/stripe is 
correct or not. In the rare occasion that a node is failing while doing a 
write; an ACK signal is not send to the caller and therefor the client can 
resend the block/stripe to another OSD. Therefor I fail to see the purpose of 
this extra journaling feature.

Also ceph schedules a data scrubbing process every day (or however it is 
configured) that should be able to tackle bad sectors or other errors on the 
file system and accordingly repair them on the same daemon or flag the whole 
block as bad. Since everything is replicated the block is still in the storage 
cluster so no harm is done.

In a normal/single file system I truly see the value of journaling and the 
potential for btrfs (although it's still very slow). However in a system like 
ceph, journaling seems to me more like a paranoid super fail save.

Did anyone experiment with file systems that disabled journaling and how did it 
perform?

Regards,
Johannes






__________ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to