Re: [DNG] grep handles ISO-8859 encoded text file as binary file.

Hendrik Boom Thu, 28 Apr 2016 11:30:06 -0700

On Thu, Apr 28, 2016 at 07:49:58PM +0200, Irrwahn wrote:
> On Thu, 28 Apr 2016 13:16:53 -0400, Hendrik Boom wrote:
> > On Thu, Apr 28, 2016 at 06:53:35AM +0000, Noel Torres wrote:
> >> Hughe Chung <janpeng...@riseup.net> escribió:
> [...]
> >>> $ grep tesselate dome_math.c
> >>> Binary file dome_math.c matches
> [...]
> >> If I were to bet, I would say that the file dome_math.c is not
> >> correctly formatted, or has an incorrect BOM at start, or so.
> > 
> > I've occasionally had a program that accepted UTF-8 reject a file 
> > because it *had* a valid BOM at the start.
> [...]
> 
> That would be because the notion of a BOM makes not much 
> sense at all for UTF-8. There is no byte order issue with 
> UTF-8, yet some brilliant mind thought it would be a good 
> idea to define and allow one (EF BB BF) anyway. And, pray 
> tell, other brilliant minds decided to use it as a way to 
> tell UTF-8 from traditional single byte encodings. This is 
> absurd, as it is just as bad as any other heuristic one 
> may come up with to deduce text file character encoding. 
> 
> To add insult to injury, some poorly written text editing 
> tools insert a BOM without any need or even being asked to, 
> deliberately breaking otherwise perfectly fine 7-bit ASCII 
> files and rendering them incompatible to legacy software.


Don't assume that ASCII is that fine.  The majority of the world uses 
languages that don't fit in ASCII.

Back in the 90's, when I was implementing C, the C standard specifed 
that a C program was mado of characters, and it did *not* specify that 
those were ASCII characters.  Now even with the various ISO nationaal 
variants on ASCII, many characters were represented using multiple 
bytes.  Strings in the source code were a sequence of characters, not 
bytes, and some of these characters could be of the two-byte 
persuasion, represented at run-time by a pair of bytes, of course.  
Some characters would be represented wiith one byte, ad some with two.

Now it just happened that one of the characters in Korean was 
represented with a two-byte pair, and one of these bytes was a zero 
byte.  Such a zero byte was *not* a terminating byte; instead it 
was part of a normal character in a normal string.  If you use the 
appropriate environ-sensitive string operations, it is not even 
be recognised as a string terminator.

Needless to say, I got involved in all this because I had to fix the 
bug in the C parser, which converted the string notation to 
a C string internally, and ended up chopping it off when 
this character showed up.  Which is what one of put Korean users was 
complaining about.

My point is that it would be good if there were some reliable way to 
distinguish the character set a file is written in.  I've standardised 
on UTF-8 myself.  Even UTF-8 i hated in Japan, because a 
lot of characters that used to take two bytes now take three, and 
Japanese uses a lot of these now space-wasting characters.

-- hendrik
_______________________________________________
Dng mailing list
Dng@lists.dyne.org
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng

Re: [DNG] grep handles ISO-8859 encoded text file as binary file.

Reply via email to