Dear Emily,

I have written a more robust version of the function:
extract.nonLetters = function(x, rm.space = TRUE, normalize=TRUE, sort=TRUE) {
    if(normalize) str = stringi::stri_trans_nfc(str);
    ch = strsplit(str, "", fixed = TRUE);
    ch = unique(unlist(ch));
    if(sort) ch = sort(ch);
    pat = if(rm.space) "^[a-zA-Z ]" else "^[a-zA-Z]";
    isLetter = grepl(pat, ch);
    ch = ch[ ! isLetter];
    return(stringi::stri_escape_unicode(ch));
}
extract.nonLetters(str)
# "\\u2013" "+"

This code ("\u2013") is included in the expanded Regex expression:
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)


Sincerely,

Leonard


On 4/13/2023 9:40 PM, Leonard Mada wrote:
Dear Emily,

Using a look-behind solves the split problem in this case. (Note: Using Regex is in most/many cases the simplest solution.)

str = c("leucocyten + gramnegatieve staven +++ grampositieve staven ++",
"leucocyten – grampositieve coccen +")

tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)

PROBLEM
The current expression does NOT work for a different reason: the "-" is coded using a NON-ASCII character.

I have written a small utility function to approximately extract "non-standard" characters:
### Identify non-ASCII Characters
# beware: the filtering and the sorting may break the codes;
extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
    code = as.numeric(unique(unlist(lapply(x, charToRaw))));
    isLetter =
        (code >= 97 & code <= 122) |
        (code >= 65 & code <= 90);
    code = code[ ! isLetter];
    if(rm.space) {
        # removes only simple space!
        code = code[code != 32];
    }
    if(sort) code = sort(code);
    return(code);
}
extract.nonLetters(str, sort = FALSE)
# 43 226 128 147

Note:
- the code for "+" is 43, and for simple "-" is 45: as.numeric (charToRaw("+-")); - "226 128 147" codes something else, but it is not trivial to get the Unicode code Point; https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec

The following is a more comprehensive Regex expression, which accepts many variants of "-":
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)

Sincerely,

Leonard



______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to