Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

Henrik Bengtsson Sun, 07 Sep 2014 19:43:11 -0700

Thank you David - it took me awhile to get back to this and dig into
it.  It's clever to imitate gtools::mixedorder() as far as possible.
A few comments:


1. It took me a while to understand why you picked 3899 in your
Roman-to-integer table; it's because roman(x) is NA for x > 3899.
(BTW, in 'utils', there's utils:::.roman2numeric() which could be
utilized, but it's currently internal.)

2. I think you forgot D=500 and M=1000.

3. There was a typo in your code; I think you meant rank.roman instead
of rank.numeric in one place.

4. The idea behind nonnumeric() is to identify non-numeric substrings
by is.na(as.numeric()).  Unfortunately, for romans that does not work.
Instead, we need to use is.na(numeric(x)) here, i.e.

  nonnumeric <- function(x) {
      suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
  }

Actually, gtools::mixedorder() could use the same.

5. I undid your ".numeric" to ".roman" to minimize any differences to
gtools::mixedorder().


With the above fixes, we now have:

mixedorderRoman <- function (x)
{
    if (length(x) < 1)
        return(NULL)
    else if (length(x) == 1)
        return(1)
    if (is.numeric(x))
        return(order(x))
    delim = "\\$\\@\\$"
    # NOTE: Note that as.roman(x) is NA for x > 3899
    romanC <- as.character( as.roman(1:3899) )
    numeric <- function(x) {
        suppressWarnings(match(x, romanC))
    }
    nonnumeric <- function(x) {
        suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
            NA))
    }
    x <- as.character(x)
    which.nas <- which(is.na(x))
    which.blanks <- which(x == "")
    if (length(which.blanks) > 0)
        x[which.blanks] <- -Inf
    if (length(which.nas) > 0)
        x[which.nas] <- Inf
    delimited <- gsub("([IVXCLM]+)",
        paste(delim, "\\1", delim, sep = ""), x)
    step1 <- strsplit(delimited, delim)
    step1 <- lapply(step1, function(x) x[x > ""])
    step1.numeric <- lapply(step1, numeric)
    step1.character <- lapply(step1, nonnumeric)
    maxelem <- max(sapply(step1, length))
    step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
        function(x) x[i]))
    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
        function(x) x[i]))
    rank.numeric <- sapply(step1.numeric.t, rank)
    rank.character <- sapply(step1.character.t, function(x)
as.numeric(factor(x)))
    rank.numeric[!is.na(rank.character)] <- 0
    rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
        2, max, na.rm = TRUE))
    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
        rank.character)
    order.frame <- as.data.frame(rank.overall)
    if (length(which.nas) > 0)
        order.frame[which.nas, ] <- Inf
    retval <- do.call("order", order.frame)
    return(retval)
}


The difference to gtools::mixedorder() is minimal:

<     romanC <- as.character( as.roman(1:3899) )
21c11
<         suppressWarnings(match(x, romanC))
---
>         suppressWarnings(as.numeric(x))
24c14
<         suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
---
>         suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
34c24
<     delimited <- gsub("([IVXCLDM]+)",
---
>     delimited <- 
> gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
59,62d48

This difference is so small that the above could now be an option to
mixedorder() with minimal overhead added, e.g. mixedorder(y,
type=c("decimal", "roman")).  One could even imagine adding support
for "binary", "octal" and "hexadecimal" (not done).

Greg (maintainer of gtools; cc:ed), is this something you would
consider adding to gtools?  I've modified the gtools source code
available on CRAN (that's the only source I found), added package
tests, updated the Rd and verified it passes R CMD check.  If
interested, please find the updates at:

  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master

Thanks

Henrik

On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <dwinsem...@comcast.net> wrote:
>
> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>
>> Hi,
>>
>> does anyone know of an implementation/function that sorts strings that
>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>> numbers.  In 'gtools' there is mixedsort() which does this for strings
>> that contains (decimal) numbers.  I'm looking for a "mixedsortroman()"
>> function that does the same but with roman numbers, e.g.
>
> It's pretty easy to sort something you know to be congruent with the existing 
> roman class:
>
> romanC <- as.character( as.roman(1:3899) )
> match(c("I", "II", "III","X","V"), romanC)
> #[1]  1  2  3 10  5
>
> But I guess you already know that, so you want a regex approach to parsing. 
> Looking at the path taken by Warnes, it would involve doing something like 
> his regex based insertion of a delimiter for "Roman numeral" but simpler 
> because he needed to deal with decimal points and signs and exponent 
> notation, none of which you appear to need. If you only need to consider 
> character and Roman, then this hack of Warnes tools succeeds:
>
>  mixedorderRoman <- function (x)
> {
>     if (length(x) < 1)
>         return(NULL)
>     else if (length(x) == 1)
>         return(1)
>     if (is.numeric(x))
>         return(order(x))
>     delim = "\\$\\@\\$"
>     roman <- function(x) {
>         suppressWarnings(match(x, romanC))
>     }
>     nonnumeric <- function(x) {
>         suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
>             NA))
>     }
>     x <- as.character(x)
>     which.nas <- which(is.na(x))
>     which.blanks <- which(x == "")
>     if (length(which.blanks) > 0)
>         x[which.blanks] <- -Inf
>     if (length(which.nas) > 0)
>         x[which.nas] <- Inf
>     delimited <- gsub("([IVXCL]+)",
>         paste(delim, "\\1", delim, sep = ""), x)
>     step1 <- strsplit(delimited, delim)
>     step1 <- lapply(step1, function(x) x[x > ""])
>     step1.roman <- lapply(step1, roman)
>     step1.character <- lapply(step1, nonnumeric)
>     maxelem <- max(sapply(step1, length))
>     step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
>         function(x) x[i]))
>     step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>         function(x) x[i]))
>     rank.roman <- sapply(step1.roman.t, rank)
>     rank.character <- sapply(step1.character.t, function(x) 
> as.numeric(factor(x)))
>     rank.roman[!is.na(rank.character)] <- 0
>     rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
>         2, max, na.rm = TRUE))
>     rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>         rank.character)
>     order.frame <- as.data.frame(rank.overall)
>     if (length(which.nas) > 0)
>         order.frame[which.nas, ] <- Inf
>     retval <- do.call("order", order.frame)
>     return(retval)
> }
>
> y[mixedorderRoman(y)]
>  [1] "chr I"    "chr II"   "chr III"  "chr IV"   "chr IX"
>  [6] "chr V"    "chr VI"   "chr VII"  "chr VIII" "chr X"
> [11] "chr XI"   "chr XII"
>
>
> --
> David.
>>
>> ## DECIMAL NUMBERS
>>> x <- sprintf("chr %d", 12:1)
>>> x
>> [1] "chr 12" "chr 11" "chr 10" "chr 9"  "chr 8"
>> [6] "chr 7"  "chr 6"  "chr 5"  "chr 4"  "chr 3"
>> [11] "chr 2"  "chr 1"
>>
>>> sort(x)
>> [1] "chr 1"  "chr 10" "chr 11" "chr 12" "chr 2"
>> [6] "chr 3"  "chr 4"  "chr 5"  "chr 6"  "chr 7"
>> [11] "chr 8"  "chr 9"
>>
>>> gtools::mixedsort(x)
>> [1] "chr 1"  "chr 2"  "chr 3"  "chr 4"  "chr 5"
>> [6] "chr 6"  "chr 7"  "chr 8"  "chr 9"  "chr 10"
>> [11] "chr 11" "chr 12"
>>
>>
>> ## ROMAN NUMBERS
>>> y <- sprintf("chr %s", as.roman(12:1))
>>> y
>> [1] "chr XII"  "chr XI"   "chr X"    "chr IX"
>> [5] "chr VIII" "chr VII"  "chr VI"   "chr V"
>> [9] "chr IV"   "chr III"  "chr II"   "chr I"
>>
>>> sort(y)
>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>> [5] "chr IX"   "chr V"    "chr VI"   "chr VII"
>> [9] "chr VIII" "chr X"    "chr XI"   "chr XII"
>>
>>> mixedsortroman(y)
>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>> [5] "chr V"    "chr VI"   "chr VII"  "chr VIII"
>> [9] "chr IX"   "chr X"    "chr XI"   "chr XII"
>>
>> The latter is what I'm looking for.
>>
>> Before hacking together something myself (e.g. identify roman numerals
>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>> to sort them and then translate them back to roman numbers), I'd like
>> to hear if someone already has this implemented/know of a package that
>> does this.
>>
>> Thanks,
>>
>> Henrik
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

Reply via email to