This code is still in trunk without any of the discussed improvements, so this change is currently part of 1.10.0-alpha1.
If we don't implement the improvements I think we should check if we want to revert to the 1.0-1.9 behavior before we really look at releasing 1.10. See discussion below Bert On Thu, Sep 8, 2016 at 5:42 PM, Ivan Zhakov <i...@visualsvn.com> wrote: > On 5 September 2016 at 19:23, Ivan Zhakov <i...@visualsvn.com> wrote: > > On 5 September 2016 at 14:46, Bert Huijben <b...@qqmail.nl> wrote: > >>> -----Original Message----- > >>> From: i...@apache.org [mailto:i...@apache.org] > >>> Sent: maandag 5 september 2016 13:33 > >>> To: comm...@subversion.apache.org > >>> Subject: svn commit: r1759233 - > >>> /subversion/trunk/subversion/libsvn_wc/questions.c > >>> > >>> Author: ivan > >>> Date: Mon Sep 5 11:32:54 2016 > >>> New Revision: 1759233 > >>> > >>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev > >>> Log: > >>> Use SHA-1 checksum to find whether files are actually modified in > working > >>> copy if timestamps don't match. > >>> > >>> Before this change we were doing this: > >>> 1. Compare file timestamps: if they match, assume that files didn't > change. > >>> 2. Open pristine file. > >>> 3. Read properties from wc.db and find whether translation is required. > >>> 4. Compare filesize with pristine filesize for files that do not > >>> require translation. Assume that file is modified if the sizes > differ. > >>> 5. Compare detranslated contents of working file with pristine. > >>> > >>> Now behavior is the following: > >>> 1. Compare file timestamps: if they match, assume that files didn't > change. > >>> 3. Read properties from wc.db and find whether translation is required. > >>> 3. Compare filesize with pristine filesize for files that do not > >>> require translation. Assume that file is modified if the sizes > differ. > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file > >>> and compare it with pristine's checksum stored in wc.db. > >> > > Hi Bert, > > > >> We looked at this before, and this change has pro-s and con-s, > depending on specific use cases. > >> > > Thanks for bringing this to dev@ list, I was not aware that this topic > > was discussed before. > > > [...] > > >> If the file happens to be a database file or something similar > >> there is quite commonly a change in the first 'block', when > >> there are changes somewhere later on. (Checksum, change > >> counter, etc.). File formats like sqlite were explicitly designed > >> for this (and other cheap checks), with a change counter at the start. > > > >> I don't think we should 'just change behavior' here, if we don't > >> have actual usage numbers for our users. Perhaps we should make > >> this feature configurable... or depending on filesize. > >> > > > > Let me summarize all possible cases that I considered before my > > change. First of all some definitions: > > * Text file (T) -- text file that require translation, due to eol > > style or keywords expansion > > * Text file (N) -- text file that doesn't require translation > > * Binary file -- some kind of binary file (database, pdf, zip, docx). > > Let's assume that user doesn't configure svn:eol-style and > > svn:keywords for them. > > * WS -- size of working file > > * PS -- size of pristine file > > > > * Old=xxx -- average required read size for old behavior in terms of > > working and pristine file sizes > > * New=xxx -- average required read size for new behavior in terms of > > working and pristine file sizes > > > > 1. Text file (T), not modified: Old = WS + PS, New = WS > > 2. Text file (N), not modified: Old = WS + PS, New = WS > > 3. Binary file, not modified: Old = WS + PS, New = WS > > 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = > WS > > 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = > WS > > 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS > > 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, > New = WS > > 8. Text file (N), modified, different size: Old = 0, New = 0 > > 9. Binary file, modified, different size: Old = 0, New = 0 > > > Hi Bert, > > I tested several different binary file formats for no-op/minimal change: > 1. Microsoft Word (docx): change single character at the end of document: > - filesize changes (case 9) > - first change at offset 2,295 of 233,323 > 2. Microsoft Word (doc): change single character at the end of document: > - filesize didn't change (case 6) > - first change at offset 540 of 479,232 > 3. sqlite database: insert one row to wc.db (2.5mb) > - filesize didn't change (case 6) > - first change at offset 27 > 4. zip archive: change single character in one of many text files (43 > mb uncompressed) > - filesize changes (case 9) > - first change at offset 7,182,933 of 10,352,080 > 5. pdf file: no-op change of 800kb file > - filesize changes (case 9) > - first change at offset 47 of 854,971 > 6. Photoshop image (psd): change one pixel in the middle > - filesize changes (case 9) > - first change at offset 32 of 69,615,507 > > With this in mind, I think we can improve the current approach so that > in would be better in all possible cases. We could do this: > 1. Open pristine file > 2. Read first 4-16kb from pristine and normalized working file > 3. Compare them: if they are equal then close pristine file, calculate > SHA1 of normalized working file and compare it with checksum in wc.db. > > This behavior would only apply for only for files larger than some > threshold (e.g 1mb) to make performance penalty for opening pristine > file negligible. > > What do you think? > > -- > Ivan Zhakov >