Re: [R] Row exclude

Avi Gross via R-help Sat, 29 Jan 2022 10:43:58 -0800


[NOTE: This is a re-send. I see it mangled multiple lines of code in sequence 
and so I shifted my temporary email sender to not use any form of rich text. 
Below would be the message I intended to send including code looking normal. As 
many other messages I create benefit from HTML, I may have to flip back and 
forth.]


There are many creative ways to solve problems and some may get you in trouble 
if you present them in class while even in some work situations, they may be 
hard for most to understand, let alone maintain and make changes.

This group is amorphous enough that we have people who want "help" who are new 
to the language, but also people who know plenty and encounter a new kind of 
problem, and of course people who want to make use of what they see as free 
labor.

Rui presented a very interesting idea and I like some aspects. But if presented 
to most people, they might have to start looking up things. 

But I admit I liked some of the ideas he uses and am adding them to my bag of 
tricks. Some were overkill for this particular requirement but that also makes 
them more general and useful.

First, was the use of locale-independent regular expressions like [[:alpha:]] 
that match any combination of [:lower:] and [:upper:] and thus are not 
restricted to ASCII characters. Since I do lots of my activities in languages 
other than English and well might include names with characters not normally 
found in English, or not even using an overlapping  alphabet, I can easily 
encounter items in the Name column that might not match [A-Za-z] but will match 
with [:alpha:].

I don't know if using [:digit:] has benefits over [0-9] and I do note there was 
no requirement to match more complex numbers than integers so no need to allow 
periods or scientific notation and so on.

Then there is the use of mapply. The more general version of the problem 
presented would include a data.frame with any number of columns, where a subset 
of the columns might need to be checked for conditions that vary across the 
columns but may include some broad categories of conditions that might be 
re-used. If all the conditions are regular expression matches you can build, 
then you can extend the list Rui used to have more items and also include 
expressions that always match so that some columns are effectively ignored:


   regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]", "[.*])


So this generalizes to N columns as long as you supply exactly N patterns in 
the list, albeit mapply does recycle arguments if needed as in the simplest 
case where you want all columns checked the same way.

Rui then uses an anonymous function to pass to mapply() and that is a newish 
feature added recently to R, I think. It was perhaps meant specifically to be 
used with the new pipe symbol, but can be used anywhere but perhaps not in 
older versions of R.


   \(x, r) grepl(r, x)


I note Rui also uses grepl() which returns a logical vector. I will show my 
first attempt at the end where I used grep() to return index numbers of matches 
instead. For this context, though, he made use of the fact that mapply in this 
case returns a matrix of type logical:

i <- mapply(\(x, r) grepl(r, x), dat1, regex)

> i

      Name   Age Weight
[1,] FALSE FALSE   TRUE
[2,] FALSE FALSE  FALSE
[3,] FALSE FALSE  FALSE
[4,] FALSE  TRUE  FALSE
[5,] FALSE FALSE  FALSE
[6,]  TRUE FALSE  FALSE

And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a 
small integer between 0 and the number of columns, inclusive, and only rows 
with no TRUE in them are wanted for this purpose:


dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!

And, yes, it could be made even more obscure as a one-liner.

My first attempt was a bit more focused on the specific needs described. I am 
not sure how the HTML destroyer in this mailing list might wreck it, but I made 
it a two-statement version that is formatted on multiple lines. An explanation 
first.

I looked at using grep() on one column at a time to look for what should NOT be 
there and ask it to invert the answer so it effectively tells me which rows to 
keep. So it tests column 1 ($Name) to see if it has digits in it and returns 
FALSE if it finds them which later means toss this row. It returns TRUE if that 
entry, so far, makes the row valid. But note since I am not using grepl() it 
does not return TRUE/FALSE at all. Rather it returns index numbers of the ones 
that now inverted are TRUE. What goes in is a vector of individual items from a 
column of the data. What goes out is the indices of which ones I want to keep 
that can be used to index the entire data.frame. Based on the ample data, it 
returns 1:5 as row 6 has a digit in "Jack3".


  grep("[0-9]", dat1$Name, invert = TRUE)


Similarly, two other grep() statements test if the second and third columns 
contain any characters in "[a-zA-Z]" and return a similar index vector if they 
are OK.

What I would then have are three numeric vectors, not a matrix. Each contains a 
subset of all the indices:


> grep("[0-9]", dat1$Name, invert = TRUE)
[1] 1 2 3 4 5
> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
[1] 1 2 3 5 6
> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
[1] 2 3 4 5 6

This set of data was designed to toss out one of each column so they all are of 
the same length but need not be. Like Rui, my condition for deciding which rows 
to keep is that all three of the index vectors have a particular entry. He 
summed them as logicals, but my choice has small integers so the way I combine 
them to exclude any not in all three is to use a sort of set intersect method. 
The one built-in to R only handles two at a time so I nested two calls to 
intersect but in a more general case, I would use some package (or build my own 
function) that handles intersecting any number of such items.

Here is the full code, minus the initialization.


rows.keep <-
intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
                    grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
          grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
result <- dat1[rows.keep,]




-----Original Message-----
From: Avi Gross via R-help <r-help@r-project.org>
To: ruipbarra...@sapo.pt <ruipbarra...@sapo.pt>; dcarl...@tamu.edu 
<dcarl...@tamu.edu>; bgunter.4...@gmail.com <bgunter.4...@gmail.com>
Cc: r-help@r-project.org <r-help@r-project.org>
Sent: Sat, Jan 29, 2022 1:04 pm
Subject: Re: [R] Row exclude

There are many creative ways to solve problems and some may get you in trouble 
if you present them in class while even in some work situations, they may be 
hard for most to understand, let alone maintain and make changes.
This group is amorphous enough that we have people who want "help" who are new 
to the language, but also people who know plenty and encounter a new kind of 
problem, and of course people who want to make use of what they see as free 
labor.
Rui presented a very interesting idea and I like some aspects. But if presented 
to most people, they might have to start looking up things. 
But I admit I liked some of the ideas he uses and am adding them to my bag of 
tricks. Some were overkill for this particular requirement but that also makes 
them more general and useful.
First, was the use of locale-independent regular expressions like [[:alpha:]] 
that match any combination of [:lower:] and [:upper:] and thus are not 
restricted to ASCII characters. Since I do lots of my activities in languages 
other than English and well might include names with characters not normally 
found in English, or not even using an overlapping  alphabet, I can easily 
encounter items in the Name column that might not match [A-Za-z] but will match 
with [:alpha:].
I don't know if using [:digit:] has benefits over [0-9] and I do note there was 
no requirement to match more complex numbers than integers so no need to allow 
periods or scientific notation and so on.
Then there is the use of mapply. The more general version of the problem 
presented would include a data.frame with any number of columns, where a subset 
of the columns might need to be checked for conditions that vary across the 
columns but may include some broad categories of conditions that might be 
re-used. If all the conditions are regular expression matches you can build, 
then you can extend the list Rui used to have more items and also include 
expressions that always match so that some columns are effectively ignored:

   regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]", "[.*])


So this generalizes to N columns as long as you supply exactly N patterns in 
the list, albeit mapply does recycle arguments if needed as in the simplest 
case where you want all columns checked the same way.
Rui then uses an anonymous function to pass to mapply() and that is a newish 
feature added recently to R, I think. It was perhaps meant specifically to be 
used with the new pipe symbol, but can be used anywhere but perhaps not in 
older versions of R.

   \(x, r) grepl(r, x)


I note Rui also uses grepl() which returns a logical vector. I will show my 
first attempt at the end where I used grep() to return index numbers of matches 
instead. For this context, though, he made use of the fact that mapply in this 
case returns a matrix of type logical:
i <- mapply(\(x, r) grepl(r, x), dat1, regex)

> i
      Name   Age Weight[1,] FALSE FALSE   TRUE[2,] FALSE FALSE  FALSE[3,] FALSE 
FALSE  FALSE[4,] FALSE  TRUE  FALSE[5,] FALSE FALSE  FALSE[6,]  TRUE FALSE  
FALSE
And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a 
small integer between 0 and the number of columns, inclusive, and only rows 
with no TRUE in them are wanted for this purpose:

dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!
And, yes, it could be made even more obscure as a one-liner.
My first attempt was a bit more focused on the specific needs described. I am 
not sure how the HTML destroyer in this mailing list might wreck it, but I made 
it a two-statement version that is formatted on multiple lines. An explanation 
first.
I looked at using grep() on one column at a time to look for what should NOT be 
there and ask it to invert the answer so it effectively tells me which rows to 
keep. So it tests column 1 ($Name) to see if it has digits in it and returns 
FALSE if it finds them which later means toss this row. It returns TRUE if that 
entry, so far, makes the row valid. But note since I am not using grepl() it 
does not return TRUE/FALSE at all. Rather it returns index numbers of the ones 
that now inverted are TRUE. What goes in is a vector of individual items from a 
column of the data. What goes out is the indices of which ones I want to keep 
that can be used to index the entire data.frame. Based on the ample data, it 
returns 1:5 as row 6 has a digit in "Jack3".

  grep("[0-9]", dat1$Name, invert = TRUE)


Similarly, two other grep() statements test if the second and third columns 
contain any characters in "[a-zA-Z]" and return a similar index vector if they 
are OK.
What I would then have are three numeric vectors, not a matrix. Each contains a 
subset of all the indices:

> grep("[0-9]", dat1$Name, invert = TRUE)[1] 1 2 3 4 5> grep("[a-zA-Z]", 
> dat1$Age, invert = TRUE)[1] 1 2 3 5 6> grep("[a-zA-Z]", dat1$Weight, invert = 
> TRUE)[1] 2 3 4 5 6
This set of data was designed to toss out one of each column so they all are of 
the same length but need not be. Like Rui, my condition for deciding which rows 
to keep is that all three of the index vectors have a particular entry. He 
summed them as logicals, but my choice has small integers so the way I combine 
them to exclude any not in all three is to use a sort of set intersect method. 
The one built-in to R only handles two at a time so I nested two calls to 
intersect but in a more general case, I would use some package (or build my own 
function) that handles intersecting any number of such items.
Here is the full code, minus the initialization.

rows.keep <-intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),        
            grep("[a-zA-Z]", dat1$Age, invert = TRUE)),          
grep("[a-zA-Z]", dat1$Weight, invert = TRUE))result <- dat1[rows.keep,]










-----Original Message-----
From: Rui Barradas <ruipbarra...@sapo.pt>
To: David Carlson <dcarl...@tamu.edu>; Bert Gunter <bgunter.4...@gmail.com>
Cc: r-help@R-project.org (r-help@r-project.org) <r-help@r-project.org>
Sent: Sat, Jan 29, 2022 3:46 am
Subject: Re: [R] Row exclude

Hello,

Getting creative, here is another way with mapply.


regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
dat1[rowSums(i) == 0L, ]

#  Name Age Weight
#2   Bob   25       142
#3 Carol   24       120
#5  Katy   35       160


Hope this helps,

Rui Barradas

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Row exclude

Reply via email to