On Tue, 03 Apr 2018, Michael Lange wrote: > I believe (please anyone correct me if I am wrong) that "text" files > won't contain any null byte; many text editors even refuse to open such a
Depends on the encoding. For ASCII, ISO-8859-* and UTF-8 (and any other modern encoding AFAIK, other than modified UTF-8), any zero bytes map one-to-one to the NUL character/code point. I don't recall how it is on other common encodings of the 80's and 90's, though. Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use bytes with the value of zero when encoding characters, so NUL is encoded by a different sequence, and you can safely use a byte with the value of zero for some out-of-band control (like zero-terminated strings that can contain NULs, etc) -- note that NUL is a character, and it might be represented by a sequence of bytes that has nothing to do with zeroes on a particular encoding... (in fact, C strings are *zero-terminated*, not NUL-terminated, but most of the time this is irrelevant :p). Also, a text file MAY contain NULs (the character), it is just considered bad practice (nowadays?). Don't assume you won't see any. For example, received e-mail is *more* likely to have NULs in it than normal text due to the quality of some mail agents out there. I recall postfix would reject a *lot* of crap when we configured it to refuse to accept NULs outside of 8-bit bodies, because Cyrus-IMAPd *refuses* any such crap, and we wanted it bounced as early as possible. (note that NULs are forbidden in MIME-compliant email text and ESMTP, unless encoded or guarded by a 8-bit transfer area of known size, so there you have it: NULs in one text format that actually forbids them!). > Probably it is the same with some other control characters like 04 (End > of Transmission). When I look at https://en.wikipedia.org/wiki/ASCII > it seems like 1C (File Separator) or 1E (Record Separator) might be > appropriate choices for you. I'm no expert on this, though. Well, ASCII control characters were inherited by ISO-8859-* and Unicode, so yes, you can use them. But so could the data file. It would be perfectly ok for a text data file to use the record separator control characters to delimit records in a table, for example... Here's a good definition of them (follow the hyperlinks for the definition of each control character): https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) Here is also a proper solution: use modified UTF-8 (which encodes NUL so that zero bytes are *never* present in the stream): encode every input format to modified UTF-8, then add the zero-byte separators you want. You'll have to normalize the input data set into known charset/encodings and then recode them to modified UTF-8, of course. You can't blindly call any random data "UTF-8" (let alone modified UTF-8) and expect things not to break horribly. -- Henrique Holschuh