On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <ava...@gmail.com> writes:
>
>> FWIW what I meant was not that we'd run around and iconv() things, it
>> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
>> valid", which presumably would be the end result of something like that.
>>
>> Rather that this model of assuming that a UTF-8 pattern means we can
>> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
>> kwset patches *revealed* that problem in a painful way, but it was there
>> already.
>
> We already do assume that pathnames are UTF-8 (pathspecs on MacOS
> are converted and then they are matched assuming that property).
> Further, with the same mechanism, I think there is an assumption
> that anything that comes from the command line is UTF-8 (and if I
> recall correctly, doesn't the Windows port of Git force us to use
> the same assumption---I recall we needed tests tweak for that).
>
> In the very very longer term, I do not think we would want to keep
> the assumption that the text encoding of blobs is always UTF-8, and
> it would be nice to extend the system, so that blob data could be
> marked in some way to say "I'm in Big-5, and not in UTF-8, so please
> treat me as such" and magically the needle and the haystack can be
> made to agree, with iconv() either one of them.
>
> But I do not think the current topic to fix the immediate/imminent
> breakage should not be distracted by that.  Let's keep assuming that
> any blob, when it is text, is UTF-8.
>
> And from that point of view, I think the two pieces of idea in your
> earlier message does make sense.  We can try to match as binary most
> of the time, as UTF-8 would not let a valid UTF-8 needle match in
> the haystack starting in the middle of a character.

*nod*

> When the user is trying to match case-insensitively, we know the
> haystack in which the user is interested in finding the needle is
> text, even though there may be non-text blobs as well.
>
> For example, "git grep -i 'foo' t/" may find a few png files under
> the t/ directory.  We do not care if they happen to contain Foo and
> we do not mind if they appear or do not appear in the result.  The
> only two things we care about are (1) foo, Foo, FOO are found in the
> text files under t/ and (2) the command does not die in the middle,
> before processing all the files, only because a png file it found
> were not UTF-8 valid.

I think this part's a step too far, and not how e.g. GNU grep
works. Peeking into binary data in a text grep is what people expect,
e.g. because you might want to recursively grep mixed text/mp3s for an
author. The text part of the mp3s means that metadata will be grepped
for inside the binary files.

Getting that right is hard around the edges though...

Reply via email to