I think you are missing the distinction between metadata journaling and data 
journaling.  In most cases a journaling filesystem is one that journal's it's 
own metadata but your data is on its own.  Consider the case where you have a 
replication level of two, the osd filesystems have journaling disabled and you 
append a block to a file (which is an object in terms of ceph) but only one 
commits the change in file size to disk.  Later you scrub and discover a 
discrepancy in object sizes, with a replication level of 2 there is no way to 
authoritatively say which one is correct just based on what's in ceph.  This is 
a similar scenario to a btrfs bug that caused me to lose data with ceph.  
Journaling your metadata is the absolute minimum level of assurance you need to 
make a transactional system like ceph work.

Hey Mike :)

I get your point. However, isn't it then possible to authoritatively say which 
one is the correct one in case of 3 OSD's?
Or is the replication level a configuration setting that tells the cluster that 
the object needs to be replicated 3 times?
In both cases, data scrubbing chooses the majority of the same-same replicated 
objects in order to know which one is authorative.

But I also believe (!) that each object has a checksum and each PG too so that 
it should be easy to find the corrupted object on any of the OSD's.
How else would scrubbing find corrupted sectors? Especially when I think about 
2TB SATA disks being hit by cosmic-rays that flip a bit somewhere.
It happens more often with big cheap TB disks, but that doesn't mean the 
corrupted sector is a bad sector (in not useable anymore). Journaling is not 
going to help anyone with this.
Therefor I believe (again) that the data scrubber must have a mechanism to 
detect these types of corruptions even in a 2 OSD setup by means of checksums 
(or better, with a hashed checksum id).

Also, aren't there 2 types of transactions; one for writing and one for 
replicating?

On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek 
<johannes.klarenb...@rigo.nl<mailto:johannes.klarenb...@rigo.nl>> wrote:

Dear ceph-users,

I read a lot of documentation today about ceph architecture and linux file 
system benchmarks in particular and I could not help notice something that I 
like to clear up for myself. Take into account that it has been a while that I 
actually touched linux, but I did some programming on php2b12 and apache back 
in the days so I'm not a complete newbie. The real question is below if you do 
not like reading the rest ;)

What I have come to understand about file systems for OSD's is that in theory 
btrfs is the file system of choice. However, due to its young age it's not 
considered stable yet. Therefore EXT4 but preferably XFS is used in most cases. 
It seems that most people choose this system because of its journaling feature 
and XFS for its additional attribute storage which has a 64kb limit which 
should be sufficient for most operations.

But when you look at file system benchmarks btrfs is really, really slow. Then 
comes XFS, then EXT4, but EXT2 really dwarfs all other throughput results. On 
journaling systems (like XFS, EXT4 and btrfs) disabling journaling actually 
helps throughput as well. Sometimes more then 2 times for write actions.

The preferred configuration for OSD's is one OSD per disk. Each object is 
striped among all Object Storage Daemons in a cluster. So if I would take one 
disk for the cluster and check its data, chances are slim that I will find a 
complete object there (a non-striped, full object I mean).

When a client issues an object write (I assume a full object/file write in this 
case) it is the client's responsibility to stripe it among the object storage 
daemons. When a stripe is successfully stored by the daemon an ACK signal is 
send to (?) the client and all participating OSD's. When all participating 
OSD's for the object have completed the client assumes all is well and returns 
control to the application

If I'm not mistaken, then journaling is meant for the rare occasions that a 
hardware failure will occur and the data is corrupted. Ceph does this too in 
another way of course. But ceph should be able to notice when a block/stripe is 
correct or not. In the rare occasion that a node is failing while doing a 
write; an ACK signal is not send to the caller and therefor the client can 
resend the block/stripe to another OSD. Therefor I fail to see the purpose of 
this extra journaling feature.

Also ceph schedules a data scrubbing process every day (or however it is 
configured) that should be able to tackle bad sectors or other errors on the 
file system and accordingly repair them on the same daemon or flag the whole 
block as bad. Since everything is replicated the block is still in the storage 
cluster so no harm is done.

In a normal/single file system I truly see the value of journaling and the 
potential for btrfs (although it's still very slow). However in a system like 
ceph, journaling seems to me more like a paranoid super fail save.

Did anyone experiment with file systems that disabled journaling and how did it 
perform?

Regards,
Johannes






__________ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



__________ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com


__________ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to