This issue looks like grist for the R Inferno.

cheers,

Rolf


On Mon, 3 Mar 2025 12:19:02 -0500
<avi.e.gr...@gmail.com> wrote:

> The second solution Ivan offers looks good, and a bit more general
> than his first that simply removes one non-visible character.
> 
> It begs the question of why the data has that anomaly at all. Did the
> data come from a text-processing environment where it was going to
> wrap there and was protected?
> 
> As Ivan points out, there is a question of what format you expect
> numbers in and what "as.numeric"  should do when it does not see an
> integer or floating point number. 
> 
> If you test it, you can see that as.numeric ignores leading and/or
> trailing blanks and tabs and even newlines sometimes and some other
> irrelevant ASCII characters. In that spirit, the UNICODE character
> being mentioned should be one that any UNICODE-aware version of
> as.numeric should ignore.
> 
> But UNICODE supports a much wider vision of numeric so that there are
> numeric-equivalent symbols in other languages and groupings and even
> something like the symbols for numerals in light or dark circles
> count as numbers. Those can likely safely be excluded in this context
> but perhaps not in a more general function.
> 
> But I note as.numeric seems to handle scientific notation as in:
> 
> as.numeric("1.23e8")
> [1] 1.23e+08
> 
> So a single instance of the letters "e" and "E" must be supported if
> your numbers in string form may contain them. Further, the E cannot
> be the first or last letter. It cannot have adjacent whitespace.
> Still, if you are OK with getting an NA in such situations, it should
> be OK.
> 
> It gets worse. Hexadecimal is supported:
> 
> > as.numeric("0X12")
> [1] 18
> 
> You now need to support the letters x and X. But only if preceded by
> a zero! 
> 
> It gets still worse as any characters from [0-9A-F] are supported:
> 
> > as.numeric("0xAE")
> [1] 174
> 
> There may be other scenarios it handles. The filter applied might
> remove valid numbers so you may want to carefully document it if your
> program only handles a restricted set.
> 
> A possible idea might be to make two passes and only  evaluate any
> resulting NA from as.numeric() by doing a substitution like Ivan
> suggests to try to fix any broken ones. But note it may fix too much
> as "1.2 e 5" might become "1.2e5" as spaces are removed.
> 
> -----Original Message-----
> From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov
> via R-help Sent: Monday, March 3, 2025 3:09 AM
> To: Christofer Bogaso <bogaso.christo...@gmail.com>
> Cc: r-help <r-help@r-project.org>
> Subject: Re: [R] Failed to convert data to numeric
> 
> В Mon, 3 Mar 2025 13:21:31 +0530
> Christofer Bogaso <bogaso.christo...@gmail.com> пишет:
> 
> > Is there any way to remove all possible "Unicode character" that may
> > be present in the array at once?
> 
> Define a range of characters you consider acceptable, and you'll be
> able to use regular expressions to remove everything else. For
> example, the following expression should remove everything except
> ASCII digits, dots, and hyphen-minus:
> 
> gsub('[^0-9.-]+', '', dat2)
> 
> There is a brief introduction to regular expressions in ?regex and
> various online resources such as <https://regex101.com/>.
> 



-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Stats. Dep't. (secretaries) phone:
         +64-9-373-7599 ext. 89622
Home phone: +64-9-480-4619

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to