This issue looks like grist for the R Inferno.
cheers, Rolf On Mon, 3 Mar 2025 12:19:02 -0500 <avi.e.gr...@gmail.com> wrote: > The second solution Ivan offers looks good, and a bit more general > than his first that simply removes one non-visible character. > > It begs the question of why the data has that anomaly at all. Did the > data come from a text-processing environment where it was going to > wrap there and was protected? > > As Ivan points out, there is a question of what format you expect > numbers in and what "as.numeric" should do when it does not see an > integer or floating point number. > > If you test it, you can see that as.numeric ignores leading and/or > trailing blanks and tabs and even newlines sometimes and some other > irrelevant ASCII characters. In that spirit, the UNICODE character > being mentioned should be one that any UNICODE-aware version of > as.numeric should ignore. > > But UNICODE supports a much wider vision of numeric so that there are > numeric-equivalent symbols in other languages and groupings and even > something like the symbols for numerals in light or dark circles > count as numbers. Those can likely safely be excluded in this context > but perhaps not in a more general function. > > But I note as.numeric seems to handle scientific notation as in: > > as.numeric("1.23e8") > [1] 1.23e+08 > > So a single instance of the letters "e" and "E" must be supported if > your numbers in string form may contain them. Further, the E cannot > be the first or last letter. It cannot have adjacent whitespace. > Still, if you are OK with getting an NA in such situations, it should > be OK. > > It gets worse. Hexadecimal is supported: > > > as.numeric("0X12") > [1] 18 > > You now need to support the letters x and X. But only if preceded by > a zero! > > It gets still worse as any characters from [0-9A-F] are supported: > > > as.numeric("0xAE") > [1] 174 > > There may be other scenarios it handles. The filter applied might > remove valid numbers so you may want to carefully document it if your > program only handles a restricted set. > > A possible idea might be to make two passes and only evaluate any > resulting NA from as.numeric() by doing a substitution like Ivan > suggests to try to fix any broken ones. But note it may fix too much > as "1.2 e 5" might become "1.2e5" as spaces are removed. > > -----Original Message----- > From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Krylov > via R-help Sent: Monday, March 3, 2025 3:09 AM > To: Christofer Bogaso <bogaso.christo...@gmail.com> > Cc: r-help <r-help@r-project.org> > Subject: Re: [R] Failed to convert data to numeric > > В Mon, 3 Mar 2025 13:21:31 +0530 > Christofer Bogaso <bogaso.christo...@gmail.com> пишет: > > > Is there any way to remove all possible "Unicode character" that may > > be present in the array at once? > > Define a range of characters you consider acceptable, and you'll be > able to use regular expressions to remove everything else. For > example, the following expression should remove everything except > ASCII digits, dots, and hyphen-minus: > > gsub('[^0-9.-]+', '', dat2) > > There is a brief introduction to regular expressions in ?regex and > various online resources such as <https://regex101.com/>. > -- Honorary Research Fellow Department of Statistics University of Auckland Stats. Dep't. (secretaries) phone: +64-9-373-7599 ext. 89622 Home phone: +64-9-480-4619 ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.