The zero-width no-break space character is used as the Byte Order Mark. That is, an official function for it at the beginning of a character sequence is to indicate whether you have 2-byte or 4-byte big-endian or little-endian encoding. It was not intended for use in UTF-8, where there is nothing for it to tell you, but Microsoft jumped in with all six feet and said "hey, we'll use this to indicate that it's Unicode in UTF-8 and not one of the hundreds of other 8-bit coded character sets." I've lost count of the number of programs that have choked because they were given a BOM where they didn't expect one.
So there is no great mystery about why there is a BOM at the beginning of this particular string. The real mystery is why it was there and NOT at the beginning of all the others. I suggest that it is a good idea to remove the BOM character from the beginning of microsofted strings, but a bad idea to remove any other character. If you are given bad data like "Bond-007" when you expect a number, you want to know about it, and not mistake it for -007. Still less do you want a phone number like "+61 3 555 1234 x77" to be mistaken for a plain number "613555123477". On Tue, 4 Mar 2025 at 06:24, <avi.e.gr...@gmail.com> wrote: > > The second solution Ivan offers looks good, and a bit more general than his > first that simply removes one non-visible character. > > It begs the question of why the data has that anomaly at all. Did the data > come from a text-processing environment where it was going to wrap there and > was protected? > > As Ivan points out, there is a question of what format you expect numbers in > and what "as.numeric" should do when it does not see an integer or floating > point number. > > If you test it, you can see that as.numeric ignores leading and/or trailing > blanks and tabs and even newlines sometimes and some other irrelevant ASCII > characters. In that spirit, the UNICODE character being mentioned should be > one that any UNICODE-aware version of as.numeric should ignore. > > But UNICODE supports a much wider vision of numeric so that there are > numeric-equivalent symbols in other languages and groupings and even > something like the symbols for numerals in light or dark circles count as > numbers. Those can likely safely be excluded in this context but perhaps not > in a more general function. > > But I note as.numeric seems to handle scientific notation as in: > > as.numeric("1.23e8") > [1] 1.23e+08 > > So a single instance of the letters "e" and "E" must be supported if your > numbers in string form may contain them. Further, the E cannot be the first > or last letter. It cannot have adjacent whitespace. Still, if you are OK with > getting an NA in such situations, it should be OK. > > It gets worse. Hexadecimal is supported: > > > as.numeric("0X12") > [1] 18 > > You now need to support the letters x and X. But only if preceded by a zero! > > It gets still worse as any characters from [0-9A-F] are supported: > > > as.numeric("0xAE") > [1] 174 > > There may be other scenarios it handles. The filter applied might remove > valid numbers so you may want to carefully document it if your program only > handles a restricted set. > > A possible idea might be to make two passes and only evaluate any resulting > NA from as.numeric() by doing a substitution like Ivan suggests to try to fix > any broken ones. But note it may fix too much as "1.2 e 5" might become > "1.2e5" as spaces are removed. > > -----Original Message----- > From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov via > R-help > Sent: Monday, March 3, 2025 3:09 AM > To: Christofer Bogaso <bogaso.christo...@gmail.com> > Cc: r-help <r-help@r-project.org> > Subject: Re: [R] Failed to convert data to numeric > > В Mon, 3 Mar 2025 13:21:31 +0530 > Christofer Bogaso <bogaso.christo...@gmail.com> пишет: > > > Is there any way to remove all possible "Unicode character" that may > > be present in the array at once? > > Define a range of characters you consider acceptable, and you'll be > able to use regular expressions to remove everything else. For example, > the following expression should remove everything except ASCII digits, > dots, and hyphen-minus: > > gsub('[^0-9.-]+', '', dat2) > > There is a brief introduction to regular expressions in ?regex and > various online resources such as <https://regex101.com/>. > > -- > Best regards, > Ivan > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.