> > +QCOW2 can use one or more instance of a metadata journal. > > s/instance/instances/ > > Is there a reason to use multiple journals rather than a single journal > for all entry types? The single journal area avoids seeks.
Here are the main reason for this: For the deduplication some patterns like cycles of insertion/deletion could leave the hash table almost empty while filling the journal. If the journal is full and the hash table is empty a packing operation is started. Basically a new journal is created and only the entry presents in the hash table are reinserted. This is why I want to keep the deduplication journal appart from regular qcow2 journal: to avoid interferences between a pack operation and regular qcow2 journal entries. The other thing is that freezing the log store would need a replay of regular qcow2 entries as it trigger a reset of the journal. Also since deduplication will not work on spinning disk I discarded the seek time factor. Maybe commiting the dedupe journal by erase block sized chunk would be a good idea to reduce random writes to the SSD. The additional reason for having multiple journals is that the SILT paper propose a mode where prefix of the hash is used to dispatch insertions in multiples store and it easier to do with multiple journals. > > > + > > +A journal is a sequential log of journal entries appended on a previously > > +allocated and reseted area. > > I think you say "previously reset area" instead of "reseted". Another > option is "initialized area". > > > +A journal is designed like a linked list with each entry pointing to the > > next > > +so it's easy to iterate over entries. > > + > > +A journal uses the following constants to denote the type of each entry > > + > > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > > +TYPE_END = 1 the entry ends a journal cluster and point to the > > next > > + cluster > > +TYPE_HASH = 2 the entry contains a deduplication hash > > + > > +QCOW2 journal entry: > > + > > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 > > This is not clear. I'm wondering if the +2 is included in the byte > value or not. I'm also wondering what a byte value of zero means and > what a byte value of 255 means. I am counting the journal entry header in the size. So yes the +2 is in the byte value. A byte value of zero, 1 or 255 is an error. Maybe this design is bogus and I should only count the payload size in the size field. It would make less tricky cases. > > Please include an example to illustrate how this field works. > > > + > > + 1 : Type of the entry > > + > > + 2 - size : The optional n bytes structure carried by entry > > + > > +A journal is divided into clusters and no journal entry can be spilled on > > two > > +clusters. This avoid having to read more than one cluster to get a single > > entry. > > + > > +For this purpose an entry with the end type is added at the end of a > > journal > > +cluster before starting to write in the next cluster. > > +The size of such an entry is set so the entry points to the next cluster. > > + > > +As any journal cluster must be ended with an end entry the size of regular > > +journal entries is limited to 254 bytes in order to always left room for > > an end > > +entry which mimimal size is two bytes. > > + > > +The only cases where size > 254 are none entries where size = 255. > > + > > +The replay of a journal stop when the first end none entry is reached. > > s/stop/stops/ > > > +The journal cluster size is 4096 bytes. > > Questions about this layout: > > 1. Journal entries have no integrity mechanism, which is especially > important if they span physical sectors where cheap disks may perform > a partial write. This would leave a corrupt journal. If the last > bytes are a checksum then you can get some confidence that the entry > was fully written and is valid. I will add a checksum mecanism. Do you have any preferences regarding the checksum function ? > > Did I miss something? > > 2. Byte-granularity means that read-modify-write is necessary to append > entries to the journal. Therefore a failure could destroy previously > committed entries. It's designed to be committed by 4KB blocks. > > Any ideas how existing journals handle this? >