Thanks to the ubiquity of Excel and its misguided inclusion of BOM codes in its UTF-8 CSV format, this optimism about encoding being a corner case seems premature. There are actually multiple options in Excel for writing CSV files, and only one of them (not the first one fortunately) has this "feature", but I (and various beginners I end up helping) seem to encounter these silly files far more frequently than seems reasonable.
On April 5, 2022 11:20:37 AM PDT, Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > >On 3/28/22 13:16, Ivan Krylov wrote: >> On Mon, 28 Mar 2022 09:54:57 +0200 >> Tomas Kalibera <tomas.kalib...@gmail.com> wrote: >> >>> Could you please clarify which part you found somewhat confusing, >>> could that be improved? >> Perhaps "somewhat confusing" is an overstatement, sorry about that. All >> the information is already there in both ?file and ?readLines, it just >> requires a bit of thought to understand it. >> >>>> When reading from a text connection, the connections code, after >>>> re-encoding based on the ‘encoding’ argument, returns text that is >>>> assumed to be in native encoding; an encoding mark is only added by >>>> functions that read from the connection, so e.g. ‘readLines’ can >>>> be instructed to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but >>>> ‘readLines’ does no further conversion. To allow reading text in >>>> ‘"UTF-8"’ on a system that cannot represent all such characters in >>>> native encoding (currently only Windows), a connection can be >>>> internally configured to return the read text in UTF-8 even though >>>> it is not the native encoding; currently ‘readLines’ and ‘scan’ use >>>> this feature when given a connection that is not yet open and, when >>>> using the feature, they unconditionally mark the text as ‘"UTF-8"’. >> The paragraph starts by telling the user that the text is decoded into >> the native encoding, then tells about marking the encoding (which is >> counter-productive when decoding arbitrarily-encoded text into native >> encoding) and only then presents the exception to the native encoding >> output rule (decoding into UTF-8). If I'm trying to read a >> CP1252-encoded file on a Windows 7 machine with CP1251 as the session >> encoding, I might get confused by the mention of encoding mark between >> the parts that are important to me. >> >> It could be an improvement to mention that exception closer to the >> first point of the paragraph and, perhaps, to split the "encoding mark" >> part from the "text connection decoding" part: >> >>>> Functions that read from the connection can add an encoding mark >>>> to the returned text. For example, ‘readLines’ can be instructed >>>> to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but does no further >>>> conversion. >>>> >>>> When given a connection that is not yet open and has a non-default >>>> ‘encoding’ argument, ‘readLines’ and ‘scan’ internally configure the >>>> connection to read text in UTF-8. Otherwise, the text after decoding >>>> is assumed to be in native encoding. >> (Maybe this is omitting too much and should be expanded.) >> >> It could also be helpful to mention the fact that the encoding argument >> to readLines() can be ignored right in the description of that >> argument, inviting the user to read the Details section for more >> information. > >Thanks for the suggestions, I've rewritten the paragraphs, biasing >towards users who have UTF-8 as the native encoding as this is going to >be the majority. These users should not have to worry much about the >encoding marks anymore, nor about the internal UTF-8 mode of the >connections code. But the level of detail I think needs to remain as >long as these features are supported - the level of detail is based on >numerous questions and bug reports. > >Best >Tomas > >______________________________________________ >R-package-devel@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-package-devel -- Sent from my phone. Please excuse my brevity. ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel