On Fri, 20 Mar 2020 13:46:25 +0100 Adam Borowski via Unicode <unicode@unicode.org> wrote:
> On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via > Unicode wrote: > > [Definition] Property: an attribute, quality, or characteristic of > > something. > > > > JPEG is a binary data format. > > CSV is a text data format. > > > > Question #1: Is the binaryness/textness of a data format a > > property? > > > > Question #2: If the answer to Question #1 is yes, then what is the > > name of this binaryness/textness property? I'd suggest 'texthood' as the correct English term. > I'm afraid this question is too fuzzy to have a proper answer. > > For example, most Unix-heads will tell you that UTF16LE is a binary > rather than text format. Microsoft employees and some members of > this list will disagree. Some files change type on changing operating system. Digital's old RMS formats included as basic text files in which each record (roughly a line) started with a binary 2-byte length field. Text records on magnetic tape typically started with an ASCII length count! > Then you have Postscript -- nothing but basic ASCII, yet utterly > unreadable for a (sane) human. No worse than a hex dump - in fact, a lot more readable. Indeed, are you not aware of the concept of a write-only programming language? > If you want _my_ definition of a file being _technically_ text, it's: > * no bytes 0..31 other than newlines and tabs (even form feeds are out > nowadays) > * correctly encoded for the expected charset (and nowadays, if that's > not UTF-8 Unicode, you're doing it wrong) > * no invalid characters Unassigned characters are perfectly reasonable in a text file. Surely you aren't saying that a text file using the characters new to Unicode 13.0 should, at present, usually be regarded as a binary file? > But besides this narrow technical meaning -- is a Word document > "text"? And if it is, why not Powerpoint? This all falls apart. Well, a .docx file isn't text - it's a variety of ZIP file, which is binary. Indeed, as word files naturally include pictures, it very much isn't a text file. A .doc file is more like an image dump of a file system. A .rtf file on the other hand, probably is a text file - though I've a feeling there are variants that aren't *A*SCII. Richard.