This is not for the R inferno. This is for the Microsoft interno, or perhaps the Unicode inferno. The Byte Order Mark is supposed to appear at the beginning of UTF-32 or UTF-16 *external* data, like a file or data coming over a socket. In the Microsoft world, it also tends to appear at the beginning of UTF-8 files, where strictly speaking, it shouldn't. ONLY at the beginning does ZWNBSP have this function.
I use a lot of programming languages, and I don't know any that routinely ignores ZWNBSP. Hmm. I wonder if the strings in this example are fields of a data file but were originally in a different order, with the last string first? What *would* make sense would be an option, when opening a connection, to skip a leading BOM. On Tue, 4 Mar 2025 at 10:45, Rolf Turner <rolftur...@posteo.net> wrote: > > > This issue looks like grist for the R Inferno. > > cheers, > > Rolf > > > On Mon, 3 Mar 2025 12:19:02 -0500 > <avi.e.gr...@gmail.com> wrote: > > > The second solution Ivan offers looks good, and a bit more general > > than his first that simply removes one non-visible character. > > > > It begs the question of why the data has that anomaly at all. Did the > > data come from a text-processing environment where it was going to > > wrap there and was protected? > > > > As Ivan points out, there is a question of what format you expect > > numbers in and what "as.numeric" should do when it does not see an > > integer or floating point number. > > > > If you test it, you can see that as.numeric ignores leading and/or > > trailing blanks and tabs and even newlines sometimes and some other > > irrelevant ASCII characters. In that spirit, the UNICODE character > > being mentioned should be one that any UNICODE-aware version of > > as.numeric should ignore. > > > > But UNICODE supports a much wider vision of numeric so that there are > > numeric-equivalent symbols in other languages and groupings and even > > something like the symbols for numerals in light or dark circles > > count as numbers. Those can likely safely be excluded in this context > > but perhaps not in a more general function. > > > > But I note as.numeric seems to handle scientific notation as in: > > > > as.numeric("1.23e8") > > [1] 1.23e+08 > > > > So a single instance of the letters "e" and "E" must be supported if > > your numbers in string form may contain them. Further, the E cannot > > be the first or last letter. It cannot have adjacent whitespace. > > Still, if you are OK with getting an NA in such situations, it should > > be OK. > > > > It gets worse. Hexadecimal is supported: > > > > > as.numeric("0X12") > > [1] 18 > > > > You now need to support the letters x and X. But only if preceded by > > a zero! > > > > It gets still worse as any characters from [0-9A-F] are supported: > > > > > as.numeric("0xAE") > > [1] 174 > > > > There may be other scenarios it handles. The filter applied might > > remove valid numbers so you may want to carefully document it if your > > program only handles a restricted set. > > > > A possible idea might be to make two passes and only evaluate any > > resulting NA from as.numeric() by doing a substitution like Ivan > > suggests to try to fix any broken ones. But note it may fix too much > > as "1.2 e 5" might become "1.2e5" as spaces are removed. > > > > -----Original Message----- > > From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov > > via R-help Sent: Monday, March 3, 2025 3:09 AM > > To: Christofer Bogaso <bogaso.christo...@gmail.com> > > Cc: r-help <r-help@r-project.org> > > Subject: Re: [R] Failed to convert data to numeric > > > > В Mon, 3 Mar 2025 13:21:31 +0530 > > Christofer Bogaso <bogaso.christo...@gmail.com> пишет: > > > > > Is there any way to remove all possible "Unicode character" that may > > > be present in the array at once? > > > > Define a range of characters you consider acceptable, and you'll be > > able to use regular expressions to remove everything else. For > > example, the following expression should remove everything except > > ASCII digits, dots, and hyphen-minus: > > > > gsub('[^0-9.-]+', '', dat2) > > > > There is a brief introduction to regular expressions in ?regex and > > various online resources such as <https://regex101.com/>. > > > > > > -- > Honorary Research Fellow > Department of Statistics > University of Auckland > Stats. Dep't. (secretaries) phone: > +64-9-373-7599 ext. 89622 > Home phone: +64-9-480-4619 > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.