The zero-width no-break space character is used as the Byte Order
Mark.  That is, an official function for it at the beginning of a
character sequence
is to indicate whether you have 2-byte or 4-byte big-endian or
little-endian encoding.  It was not intended for use in UTF-8, where
there is nothing for
it to tell you, but Microsoft jumped in with all six feet and said
"hey, we'll use this to indicate that it's Unicode in UTF-8 and not
one of the hundreds
of other 8-bit coded character sets."  I've lost count of the number
of programs that have choked because they were given a BOM where they
didn't expect one.

So there is no great mystery about why there is a BOM at the beginning
of this particular string.
The real mystery is why it was there and NOT at the beginning of all the others.

I suggest that it is a good idea to remove the BOM character from the
beginning of microsofted strings,
but a bad idea to remove any other character.  If you are given bad
data like "Bond-007" when you
expect a number, you want to know about it, and not mistake it for
-007.  Still less do you want a
phone number like "+61 3 555 1234 x77" to be mistaken for a plain
number "613555123477".

On Tue, 4 Mar 2025 at 06:24, <avi.e.gr...@gmail.com> wrote:
>
> The second solution Ivan offers looks good, and a bit more general than his 
> first that simply removes one non-visible character.
>
> It begs the question of why the data has that anomaly at all. Did the data 
> come from a text-processing environment where it was going to wrap there and 
> was protected?
>
> As Ivan points out, there is a question of what format you expect numbers in 
> and what "as.numeric"  should do when it does not see an integer or floating 
> point number.
>
> If you test it, you can see that as.numeric ignores leading and/or trailing 
> blanks and tabs and even newlines sometimes and some other irrelevant ASCII 
> characters. In that spirit, the UNICODE character being mentioned should be 
> one that any UNICODE-aware version of as.numeric should ignore.
>
> But UNICODE supports a much wider vision of numeric so that there are 
> numeric-equivalent symbols in other languages and groupings and even 
> something like the symbols for numerals in light or dark circles count as 
> numbers. Those can likely safely be excluded in this context but perhaps not 
> in a more general function.
>
> But I note as.numeric seems to handle scientific notation as in:
>
> as.numeric("1.23e8")
> [1] 1.23e+08
>
> So a single instance of the letters "e" and "E" must be supported if your 
> numbers in string form may contain them. Further, the E cannot be the first 
> or last letter. It cannot have adjacent whitespace. Still, if you are OK with 
> getting an NA in such situations, it should be OK.
>
> It gets worse. Hexadecimal is supported:
>
> > as.numeric("0X12")
> [1] 18
>
> You now need to support the letters x and X. But only if preceded by a zero!
>
> It gets still worse as any characters from [0-9A-F] are supported:
>
> > as.numeric("0xAE")
> [1] 174
>
> There may be other scenarios it handles. The filter applied might remove 
> valid numbers so you may want to carefully document it if your program only 
> handles a restricted set.
>
> A possible idea might be to make two passes and only  evaluate any resulting 
> NA from as.numeric() by doing a substitution like Ivan suggests to try to fix 
> any broken ones. But note it may fix too much as "1.2 e 5" might become 
> "1.2e5" as spaces are removed.
>
> -----Original Message-----
> From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov via 
> R-help
> Sent: Monday, March 3, 2025 3:09 AM
> To: Christofer Bogaso <bogaso.christo...@gmail.com>
> Cc: r-help <r-help@r-project.org>
> Subject: Re: [R] Failed to convert data to numeric
>
> В Mon, 3 Mar 2025 13:21:31 +0530
> Christofer Bogaso <bogaso.christo...@gmail.com> пишет:
>
> > Is there any way to remove all possible "Unicode character" that may
> > be present in the array at once?
>
> Define a range of characters you consider acceptable, and you'll be
> able to use regular expressions to remove everything else. For example,
> the following expression should remove everything except ASCII digits,
> dots, and hyphen-minus:
>
> gsub('[^0-9.-]+', '', dat2)
>
> There is a brief introduction to regular expressions in ?regex and
> various online resources such as <https://regex101.com/>.
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to