The second solution Ivan offers looks good, and a bit more general than his 
first that simply removes one non-visible character.

It begs the question of why the data has that anomaly at all. Did the data come 
from a text-processing environment where it was going to wrap there and was 
protected?

As Ivan points out, there is a question of what format you expect numbers in 
and what "as.numeric"  should do when it does not see an integer or floating 
point number. 

If you test it, you can see that as.numeric ignores leading and/or trailing 
blanks and tabs and even newlines sometimes and some other irrelevant ASCII 
characters. In that spirit, the UNICODE character being mentioned should be one 
that any UNICODE-aware version of as.numeric should ignore.

But UNICODE supports a much wider vision of numeric so that there are 
numeric-equivalent symbols in other languages and groupings and even something 
like the symbols for numerals in light or dark circles count as numbers. Those 
can likely safely be excluded in this context but perhaps not in a more general 
function.

But I note as.numeric seems to handle scientific notation as in:

as.numeric("1.23e8")
[1] 1.23e+08

So a single instance of the letters "e" and "E" must be supported if your 
numbers in string form may contain them. Further, the E cannot be the first or 
last letter. It cannot have adjacent whitespace. Still, if you are OK with 
getting an NA in such situations, it should be OK.

It gets worse. Hexadecimal is supported:

> as.numeric("0X12")
[1] 18

You now need to support the letters x and X. But only if preceded by a zero! 

It gets still worse as any characters from [0-9A-F] are supported:

> as.numeric("0xAE")
[1] 174

There may be other scenarios it handles. The filter applied might remove valid 
numbers so you may want to carefully document it if your program only handles a 
restricted set.

A possible idea might be to make two passes and only  evaluate any resulting NA 
from as.numeric() by doing a substitution like Ivan suggests to try to fix any 
broken ones. But note it may fix too much as "1.2 e 5" might become "1.2e5" as 
spaces are removed.

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov via R-help
Sent: Monday, March 3, 2025 3:09 AM
To: Christofer Bogaso <bogaso.christo...@gmail.com>
Cc: r-help <r-help@r-project.org>
Subject: Re: [R] Failed to convert data to numeric

В Mon, 3 Mar 2025 13:21:31 +0530
Christofer Bogaso <bogaso.christo...@gmail.com> пишет:

> Is there any way to remove all possible "Unicode character" that may
> be present in the array at once?

Define a range of characters you consider acceptable, and you'll be
able to use regular expressions to remove everything else. For example,
the following expression should remove everything except ASCII digits,
dots, and hyphen-minus:

gsub('[^0-9.-]+', '', dat2)

There is a brief introduction to regular expressions in ?regex and
various online resources such as <https://regex101.com/>.

-- 
Best regards,
Ivan

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to