On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote: > On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > > > Based upon an offlist communication this morning, I am somewhat confused > > (more than I usually am on most Monday mornings...) about the use of > > grep() with factors as the 'x' argument. > > ... > > > grep("[a-z]", letters) > > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > > [23] 23 24 25 26 > > > > > grep("[a-z]", factor(letters)) > > numeric(0) > > I was recently surprised by this also. In addition, if > R's grep did support factors in this way, what sort of > object (factor or character) should it return when value=T? > I recently changed Splus's grep to return a character vector in > that case. > > Splus> grep("[def]", letters[26:1]) > [1] 21 22 23 > Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) > [1] 21 22 23 > Splus> grep("[def]", letters[26:1], value=T) > [1] "f" "e" "d" > Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) > [1] "f" "e" "d" > Splus> class(.Last.value) > [1] "character" > > R does this when grepping an integer vector. > R> grep("1", 0:11, value=T) > [1] "1" "10" "11" > help(grep) says it returns "the matching elements themselves", but > doesn't say if "themselves" means before or after the conversion to > character.
Bill, My first inclination for the return value when used on a factor would be the indexed factor elements where grep() would otherwise simply return the indices. This would also maintain the factor levels from the original source factor since "[".factor would normally retain these when drop = FALSE. For example: # Return the indexed values as would otherwise be done # in grep() if the factor to character coercion takes place: # Use the same indices 21:23 as above > factor(letters[26:1], levels = letters[26:1])[21:23] [1] f e d Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a >From my read of the C code in do_grep() in character.c (again, if correct), when 'value = TRUE', the C code appears to first get the indices and then build the returned vector from the indexed values from the source vector in a for() loop. So this should not be a problem philosophically. However, given your example of the coercion of integers, perhaps with grep() at least, consistent behavior would dictate that return values are always character vectors. These could then be coerced manually back to a factor, using the original levels, as may be required: > factor.letters <- factor(letters[26:1], levels=letters[26:1]) > factor.letters [1] z y x w v u t s r q p o n m l k j i h g f e d c b a Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > grep("[def]", as.character(factor.letters)) [1] 21 22 23 > res <- grep("[def]", as.character(factor.letters), value = TRUE) > res [1] "f" "e" "d" > factor(res, levels = levels(factor.letters)) [1] f e d Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a Which of course is the same result I proposed initially above. I could be convinced either way. The concern of course being that (given the offlist replies I have received today) even experienced users are getting bitten by the current behavior versus their intuitive expectations, which are at least loosely supported by the documentation. HTH, Marc Schwartz ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel