Re: file obstruction upon merging an already merged added/moved file (#4649)

2016-09-05 Thread Stefan Hett

On 8/22/2016 5:51 PM, Stefan Hett wrote:

Hi,

as per stsp's suggestion, sending this to the dev list.

I've reduced a problem we've been having with some merge operations on 
our main repository from time to time down to the repro script 
attached to this JIRA issue [1] (test_obstruction.bat).


Instead of a successful merge, which I'd expect in this case, I get 
the following error instead when doing an svn merge [URL] -cX:

"local file obstruction, incoming file add upon merge"

The test case performs the following steps:
1. create test file and commit
2. modify the file twice and commit each modification separately
3. rename the file and commit
4. merge changes into the branch and commit
5. modify the file on trunk and commit
6. merge the branch back into trunk cherry picking the revision which 
was created in step 4


At this point svn merge produces the error mentioned above.

To me this is unexpected, since I would expect the merge to succeed. 
In the end all I'm doing is merging back the same file into trunk 
(even though in trunk the file was modified).
This looks especially kind of inconsistent to me, since if I do either 
of the following merges instead of a cherry picking merge, the merge 
succeeds without producing any file conflicts:

- "svn merge [URL]" or
- "svn merge [URL]@X+1"

[1] https://issues.apache.org/jira/browse/SVN-4649

Can anybody confirm that the behavior I see is unexpected and considered 
a bug? Or is this considered by design?


--
Regards,
Stefan Hett



Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

2016-09-05 Thread Bert Huijben


> -Original Message-
> From: i...@apache.org [mailto:i...@apache.org]
> Sent: maandag 5 september 2016 13:33
> To: comm...@subversion.apache.org
> Subject: svn commit: r1759233 -
> /subversion/trunk/subversion/libsvn_wc/questions.c
> 
> Author: ivan
> Date: Mon Sep  5 11:32:54 2016
> New Revision: 1759233
> 
> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> Log:
> Use SHA-1 checksum to find whether files are actually modified in working
> copy if timestamps don't match.
> 
> Before this change we were doing this:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 2. Open pristine file.
> 3. Read properties from wc.db and find whether translation is required.
> 4. Compare filesize with pristine filesize for files that do not
>require translation. Assume that file is modified if the sizes differ.
> 5. Compare detranslated contents of working file with pristine.
> 
> Now behavior is the following:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 3. Read properties from wc.db and find whether translation is required.
> 3. Compare filesize with pristine filesize for files that do not
>require translation. Assume that file is modified if the sizes differ.
> 4. Calculate SHA-1 checksum of detranslated contents of working file
>and compare it with pristine's checksum stored in wc.db.

We looked at this before, and this change has pro-s and con-s, depending on 
specific use cases.

With the compare to SHA we only have to read the new file, but we always have 
to read the file 100%.

With the older system we could bail on the first detected change.

If there is a change somewhere both systems read on average 100% of the 
filesize... only if there is no actual change except for the timestamp, the new 
system is less expensive.


If the file happens to be a database file or something similar there is quite 
commonly a change in the first 'block', when there are changes somewhere later 
on. (Checksum, change counter, etc.). File formats like sqlite were explicitly 
designed for this (and other cheap checks), with a change counter at the start.


I don't think we should 'just change behavior' here, if we don't have actual 
usage numbers for our users. Perhaps we should make this feature 
configurable... or depending on filesize. 



We certainly want the new behavior for non-pristine working copies (on the IDEA 
list for years), but I'm not sure if we always want this behavior as only 
option.



This mail is partially, to just discuss this topic on the list, to make sure 
everybody knows what happened here and why.



Bert

(Note that it is labor day in the USA today... so I don't expect many responses 
until later this week)



svn does not react to Ctrl-C when a background process has an unredirected stdout

2016-09-05 Thread Vincent Lefevre
I've finally found a way to reproduce the following old bug:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=50

(it still occurs with svn 1.9.4).

Make SVN_SSH point to the following script:


#!/bin/sh

sleep 10 &

while true; do :; done


Do a "svn up" and Ctrl-C. Then svn terminates only when "sleep"
terminates. If I redirect the output of "sleep" to /dev/null,
like that:


#!/bin/sh

sleep 10 > /dev/null &

while true; do :; done


then the problem disappears, i.e. Ctrl-C immediately terminates svn.

Note: The reason this problem occurs in my case is that my script
does "ssh -fMN ..." (which corresponds to the "sleep ... &" here)
when the control socket doesn't exist. The infinite loop corresponds
to the real ssh that would freeze due to a network problem.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


RE: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

2016-09-05 Thread Markus Schaber
Hi,

From: Bert Huijben [mailto:b...@qqmail.nl]
> From: i...@apache.org [mailto:i...@apache.org]
> > Sent: maandag 5 september 2016 13:33
> > To: comm...@subversion.apache.org
> > Subject: svn commit: r1759233 -
> > /subversion/trunk/subversion/libsvn_wc/questions.c
> >
> > Author: ivan
> > Date: Mon Sep  5 11:32:54 2016
> > New Revision: 1759233
> >
> > URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> > Log:
> > Use SHA-1 checksum to find whether files are actually modified in
> > working copy if timestamps don't match.
> >
> > Before this change we were doing this:
> > 1. Compare file timestamps: if they match, assume that files didn't change.
> > 2. Open pristine file.
> > 3. Read properties from wc.db and find whether translation is required.
> > 4. Compare filesize with pristine filesize for files that do not
> >require translation. Assume that file is modified if the sizes differ.
> > 5. Compare detranslated contents of working file with pristine.
> >
> > Now behavior is the following:
> > 1. Compare file timestamps: if they match, assume that files didn't change.
> > 3. Read properties from wc.db and find whether translation is required.
> > 3. Compare filesize with pristine filesize for files that do not
> >require translation. Assume that file is modified if the sizes differ.
> > 4. Calculate SHA-1 checksum of detranslated contents of working file
> >and compare it with pristine's checksum stored in wc.db.
> 
> We looked at this before, and this change has pro-s and con-s, depending on
> specific use cases.
> 
> With the compare to SHA we only have to read the new file, but we always have
> to read the file 100%.
> 
> With the older system we could bail on the first detected change.
> 
> If there is a change somewhere both systems read on average 100% of the
> filesize... only if there is no actual change except for the timestamp, the
> new system is less expensive.> 
> 
> If the file happens to be a database file or something similar there is quite
> commonly a change in the first 'block', when there are changes somewhere
> later on. (Checksum, change counter, etc.). File formats like sqlite were
> explicitly designed for this (and other cheap checks), with a change counter
> at the start.

Maybe we could cache a checksum of the first block (4k) of each file in the 
wc.db, to profit from those changes. Thus, we only need to read the whole file 
if the first block checksum is equal.

> I don't think we should 'just change behavior' here, if we don't have actual
> usage numbers for our users. Perhaps we should make this feature
> configurable... or depending on filesize.
> 
> We certainly want the new behavior for non-pristine working copies (on the
> IDEA list for years), but I'm not sure if we always want this behavior as
> only option.
> 
> This mail is partially, to just discuss this topic on the list, to make sure
> everybody knows what happened here and why.
> 
> 
> 
>   Bert
> 
> (Note that it is labor day in the USA today... so I don't expect many
> responses until later this week)


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.scha...@codesys.com | Web: http://www.codesys.com | CODESYS store: 
http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this 
e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.


Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

2016-09-05 Thread Ivan Zhakov
On 5 September 2016 at 14:46, Bert Huijben  wrote:
>> -Original Message-
>> From: i...@apache.org [mailto:i...@apache.org]
>> Sent: maandag 5 september 2016 13:33
>> To: comm...@subversion.apache.org
>> Subject: svn commit: r1759233 -
>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>
>> Author: ivan
>> Date: Mon Sep  5 11:32:54 2016
>> New Revision: 1759233
>>
>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>> Log:
>> Use SHA-1 checksum to find whether files are actually modified in working
>> copy if timestamps don't match.
>>
>> Before this change we were doing this:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 2. Open pristine file.
>> 3. Read properties from wc.db and find whether translation is required.
>> 4. Compare filesize with pristine filesize for files that do not
>>require translation. Assume that file is modified if the sizes differ.
>> 5. Compare detranslated contents of working file with pristine.
>>
>> Now behavior is the following:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 3. Read properties from wc.db and find whether translation is required.
>> 3. Compare filesize with pristine filesize for files that do not
>>require translation. Assume that file is modified if the sizes differ.
>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>and compare it with pristine's checksum stored in wc.db.
>
Hi Bert,

> We looked at this before, and this change has pro-s and con-s, depending on 
> specific use cases.
>
Thanks for bringing this to dev@ list, I was not aware that this topic
was discussed before.

> With the compare to SHA we only have to read the new file, but we
> always have to read the file 100%.
>
> With the older system we could bail on the first detected change.
>
I considered this trade off. See below.

> If there is a change somewhere both systems read on average
> 100% of the filesize... only if there is no actual change except
> for the timestamp, the new system is less expensive.
>
As far I understand the average characteristics are:
1. Files are equal:
   a) old behavior: 100% read of working file + 100% read of pristine
file = 200% of working file size.
   b) new behavior: 100% read of working file = 100% of working file size.

2. Files modified (but has the same size or require translation (!)):
   a) old behavior: 50% (average) read of working file + 50% (average)
read of pristine file = 100% of working file size.
   b) new behavior: 100% read of working file = 100% of working file size.

(Strictly speaking, average read size would also depend on the number
of modifications, and it could be less than 50%.)

Also libsvn_wc checks working file size for files that doesn't require
translation, before comparing contents. And keyword expansion/newline
translation doesn't make sense for binary files (like database, pdf,
docx).  And for most binary files format modification involves
changing its size.

(There were problem in old behavior because pristine file was opened
*before* comparing working file size. Fixing that would require
additional SQLite operation.)

> If the file happens to be a database file or something similar
> there is quite commonly a change in the first 'block', when
> there are changes somewhere later on. (Checksum, change
> counter, etc.). File formats like sqlite were explicitly designed
> for this (and other cheap checks), with a change counter at the start.

> I don't think we should 'just change behavior' here, if we don't
> have actual usage numbers for our users. Perhaps we should make
> this feature configurable... or depending on filesize.
>

Let me summarize all possible cases that I considered before my
change. First of all some definitions:
* Text file (T) -- text file that require translation, due to eol
style or keywords expansion
* Text file (N) -- text file that doesn't require translation
* Binary file -- some kind of binary file (database, pdf, zip, docx).
Let's assume that user doesn't configure svn:eol-style and
svn:keywords for them.
* WS -- size of working file
* PS -- size of pristine file

* Old=xxx -- average required read size for old behavior in terms of
working and pristine file sizes
* New=xxx -- average required read size for new behavior in terms of
working and pristine file sizes

1. Text file (T), not modified:  Old = WS + PS, New = WS
2. Text file (N), not modified:  Old = WS + PS, New = WS
3. Binary file, not modified:  Old = WS + PS, New = WS
4. Text file (T), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
5. Text file (N), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
6. Binary file, modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
7. Text file (T), modified, different size:  Old = 0.5 * WS + 0.5 * PS, New = WS
8. Text file (N), modified, different size:  Old = 0, New = 0
9. Binary file, modified, different size:  Old = 0, New = 0

(There is some overhead for SH

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

2016-09-05 Thread Stefan Hett

On 9/5/2016 6:23 PM, Ivan Zhakov wrote:

With all above the new behavior should be working better or the same
in all cases. I agree that 50% approximation may be incorrect for some
specific binary formats (case 6) like sqlite db.
To be fair, I'd argue that in case of binary file modifications the 
approximation is quite off. Most binary formats (if not all) in our 
repository differ in the first couple of bytes (if they were changed) 
and therefore it's quite a significant difference whether we read the 
full file contents of a single file (which might be >100MB) or just the 
first few bytes of two files.


As Bert already suggested, I totally support the statement that it's 
quite a common design pattern for binary formats to have some checksum, 
time stamp, counter value, filesize record, etc. at the beginning of the 
file contents which is likely to differ, if the file has changed. If you 
then take the file sizes differences between text files and binary files 
into account (aka: text files usually being quite small, while binary 
files usually being quite large) it certainly has the potential to 
matter quite much that there's a difference expected for the binary file 
comparison case.


FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k 
block and another for the full file) sounds therefore as a reasonable 
suggestion.


Last but not least the throughput of calculating the SHA-1 is also 
restricted by the I/O throughput in practice. For working directories 
I'd assume it's not too unlikely to still reside on some HDD (rather 
than some faster cache or an SSD) so it'd be limited to around 20 MB/s 
in practice. Given large binary files this might pose a significant 
difference in certain (not uncommon) use-cases.


Don't get this wrong: IMHO I agree that the SHA-1 approach is superior 
(especially on Windows machines since it will reduce the cases where two 
files have to be opened - pointer: anti virus scanner impacts). I just 
share Bert's opinion here that the approach should be a bit improved 
especially in light of binary file support.


If it would be of any help, I could do some performance measurements 
with the two approaches on our repository to get some real world numbers 
to work with.


--
Regards,
Stefan Hett



Re: file obstruction upon merging an already merged added/moved file (#4649)

2016-09-05 Thread Julian Foad



On 05/09/16 10:55, Stefan Hett wrote:

On 8/22/2016 5:51 PM, Stefan Hett wrote:

Hi,

as per stsp's suggestion, sending this to the dev list.

I've reduced a problem we've been having with some merge operations on
our main repository from time to time down to the repro script
attached to this JIRA issue [1] (test_obstruction.bat).

Instead of a successful merge, which I'd expect in this case, I get
the following error instead when doing an svn merge [URL] -cX:
"local file obstruction, incoming file add upon merge"

The test case performs the following steps:
1. create test file and commit
2. modify the file twice and commit each modification separately
3. rename the file and commit
4. merge changes into the branch and commit
5. modify the file on trunk and commit
6. merge the branch back into trunk cherry picking the revision which
was created in step 4

At this point svn merge produces the error mentioned above.

To me this is unexpected, since I would expect the merge to succeed.
In the end all I'm doing is merging back the same file into trunk
(even though in trunk the file was modified).
This looks especially kind of inconsistent to me, since if I do either
of the following merges instead of a cherry picking merge, the merge
succeeds without producing any file conflicts:
- "svn merge [URL]" or
- "svn merge [URL]@X+1"

[1] https://issues.apache.org/jira/browse/SVN-4649


Can anybody confirm that the behavior I see is unexpected and considered
a bug? Or is this considered by design?


I haven't looked the details of your script, but when you write "all I'm 
doing is merging back the same file into trunk" this suggests to me you 
may not be thinking about merge the same way Subversion is designed to 
do. It's designed to merge two *changes* -- and you suggest the revision 
you're trying to merge back is "step 4" which is a change that includes 
the *creation* of a file. It is expected that if you try to merge one 
set of changes that create a file with another set of changes that also 
creates that file, then you get a conflict, sometimes called an "add 
versus add" conflict, or apparently here is called "obstruction".


I hope that helps.

- Julian