Thomas Gummerer <t.gumme...@gmail.com> writes: > On 01/24, Junio C Hamano wrote: >> To put it differently, if a blob stored at path has CRLF line >> endings and .gitattributes is changed after the fact to say that it >> must have LF line endings, we should treat it as a broken transitory >> state. > > Right, I wasn't considering this as a broken state, because t0025 uses > just this to transition between the states. > >> There may have to be a way to "fix" an already "wrong" blob >> in the index that is milder than "rm --cached && add .", but I do >> not think write_entry(), which is shared by all the normal codepaths >> that writes out to the working tree, is the best place to do so, if >> doing so forces the contents of the paths to be always re-checked, >> just in case the user is in such a broken transitory state. > > Maybe I'm misunderstanding something, but the contents of the paths > are only re-checked if we are in such a broken transition state, and
What I do not understand here is how the added check ensures that "only if in such a broken transition state". would_convert_to_git() does not take the contents but is called only with the pathname to key into the attributes, so in a typical cross platform project where all the source files are "text" and the repository can set core.eol to adjust the end of line convention for its working tree, the added check has no way to differentiate the paths that are recorded with CRLF line endings in the object database by mistake, i.e. the ones in the broken transitory state, and all the other paths that are following the "text" and storing their blobs with LF line endings. The added check would trigger "is it racy" check to re-reads the contents that we have written out from the working tree for the paths with "wrong" blobs, but how would it avoid doing so for the paths whose blobs are already stored correctly? The new code affects not just "reset --hard", but everybody who writes out from the object database to the working tree and records that these paths are checked out in the index. How does the new code avoid destroying the performance for all paths? > the file stored in git has crlf line endings, and thus would be > normalized. In this case we currently re-check the contents of that > file anyway, at least when the file and the index have the same mtime, > and we actually show the correct output. The right way to at that "correct output", I think, is that it happens to be shown that way by accident. It is questionable that it is correct to report that such a path is modified. Immediately after you check out a path to the working tree out of the index, via "reset --hard" and "checkout $path", by definition it ought to be the same between the working tree file and the index. Unless it is in this transititory broken state, that is. The "by accident" happens only because racy-git avoidance is being implemented in one particular way. Is about protecting people from making changes to the working tree files immediately after their previous contents are registered to the index (and the index files written to the disk), and immediately after we write things out of the index and by definition the result and the indexed blob ought to match there is no reason to re-read and recheck unless the working tree files are further edited. The current way the racy-git avoidance works by re-reading and re-checking the contents when the timestamps match is merely one possible implementation, and that is the only thing that produces your "correct" output most of the time in this test. We could have waited after writing the index time for a second before giving the control back to the user instead, which would have also allowed us to implement the racy-git avoidance, and in that alternate world, all these transitory broken paths would have been correctly reported as unmodified. IOW, I would think the test in question is insisting a single outcome for an operation whose result is undefined, and it is harmful to twist the system by pessimizing the common cases just to cater to this transititory broken state. I am not saying that we shouldn't have support for users to fix their repository and get out of this transititory broken state. A recent work by Torsten Bögershausen to have ls-files report the end of line convention used in the blob in the index and the settings that affect conversion for each path (among other things) is a step in the right direction. With a support like that, those who noticed that they by mistake added CRLF files to the index as-is when they wanted their project to be cross platform can recover from it by setting necessary attributes (i.e. mark them as "text") and then find paths that are broken out of "ls-files --eol" output to see which ones are not using lf end-of-line in the index. I do not think there is a canned command to help dealing with these broken paths right now. You would have to check them out of the index (you would get a CRLF file in the working tree in the example we are discussing), fix the line endings (you would run dos2unix on it in this example, as you would want "text=true" attribute) and "git add" them to recover manually, but I can imagine that Torsten's work can be extended to do all of these, without molesting the working tree files, with minimum work by the end user. That is: * Reuse Torsten's "ls-files --eol" code to find paths that record the blob in the index that does not follow the eol convention specified for the path; * For each of these index entries, run convert_to_working_tree() on the indexed contents, and then on the result of it, run convert_to_git(). The result is the blob that the index ought to have had, if it were to be consistent with the attribute settings. So add that to the index. * Write the index out. * Tell the user to commit or commit it automatically with a canned log message "fix broken encoding" or something. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html