Re: [R] Row exclude

Bert Gunter Sat, 29 Jan 2022 12:29:00 -0800

Rui:

You made my day! -- or at least considerably improved it. Your
solution was clever and clear. IMHO, it is also a terrific example of
why one should expend the effort to really learn the core features of
the language before plunging into packages with alternative paradigms.
(But lots of wise folks will disagree, so let's not debate that and
just consider me a luddite if you like).


A minor tweak would be to add punctuation characters to the regex's:

> dig <- '[[:digit:][:punct:]]' ; nondig <- '[[:alpha:][:punct:]]'
> mapply(\(r,x)grepl(r,x),list(dig, nondig, nondig), dat1)

This of course would need to be modified for numeric columns with '.'
or ',' as a decimal separator. Most examples I've seen were of
contamination by a particular character or two (like ',' )) for
numeric entries, which could be easily handled of course.

As usual, one of the virtues of a nice solution like yours is that it
can easily be generalized, say to the case of a data frame with 100's
of columns. One just has to be a bit careful about details.

A usual 'gotcha' will be to ensure that factor columns are read in or
converted to character.  Another is that you need to first remove any
non-character -- typically non-polluted numeric -- columns from the
data frame. This can be done by something like:
dat <- dat[, sapply(dat, is.character)]
Anyway, with those caveats and perhaps others that I either haven't
thought of or may be data-specific, here is an example that
illustrates how nicely your approach extends.

I'll start from the OP's dat1 example.

 dat1 <-read.table(text="Name, Age, Weight
 Alex,  20,  13X
 Bob,   25,  142
 Carol, 24,  120
 John,  3BC,  175
 Katy,  35,  160
 Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)

## now enlarge the table and add a gender column which should contain
only upper or lower case 'm','f', 'o' ## but which I have corrupted
with some 'g's (typos)

 set.seed(9901)
 genderAbb <- c('M','F','O','m','f','o','g')
 gender <- sample(genderAbb, 24,dim rep = TRUE)
 dat1 <- cbind(dat1[rep(1:6,4),],
                    Gender = gender
               )
head(dat1, 8)

      Name   Age Weight Gender
1     Alex    20    13X      O
2      Bob    25    142      M
3    Carol    24    120      o
4     John   3BC    175      o
5     Katy    35    160      f
6    Jack3    34    140      g
1.1   Alex    20    13X      M
2.1    Bob    25    142      f

## Now create a list of the different target 'types' for columns.
## Note that these types are user-created categories, not R data types.
## So one can use whatever names one wants.
## Or could use numeric values -- but that obfuscates the meaning and
increases the risk of error, imo.
type <- c('char', 'int', 'gend') ## obvious

## Now, using your idea, determine the regex's that identify bad
entries for each type,
 badpat <- list(
             char = '[[:punct:][:digit:]]', ## added stray punctuation
             int = '[[:punct:][:alpha:]]', ## ditto
             gend = '[^MFOmfo]' )  ## the only gender abbreviations
that will be accepted.
                ## The initial '^' is the regex symbol for 'anything
*but* these in character classes


## Now identify what type of data each column should contain. This is
the part that could be tedious
## for many columns, but I see no way of avoiding it. A smarter UI
than I give would help!
target_type <- c('char','int','int','gend')

## and create the corresponding list of regex patterns to use for mapply()
target_pat <- badpat[target_type]

## Now do the Barradas trick
result <- mapply(\(pat,x)if(is.character(x))grepl(pat, x)
               else rep(FALSE, NROW(x)),
       target_pat,
       dat1)
head(result, 8) ## it's a matrix, not a data frame of course
## ... and then proceed as you showed.
Cheers,
Bert


On Sat, Jan 29, 2022 at 12:46 AM Rui Barradas <ruipbarra...@sapo.pt> wrote:
>
> Hello,
>
> Getting creative, here is another way with mapply.
>
>
> regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")
>
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> dat1[rowSums(i) == 0L, ]
>
> #  Name Age Weight
> #2   Bob   25       142
> #3 Carol   24       120
> #5  Katy   35       160
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 06:30 de 29/01/2022, David Carlson via R-help escreveu:
> > Given that you know which columns should be numeric and which should be
> > character, finding characters in numeric columns or numbers in character
> > columns is not difficult. Your data frame consists of three character
> > columns so you can use regular expressions as Bert mentioned. First you
> > should strip the whitespace out of your data:
> >
> > dat1 <-read.table(text="Name, Age, Weight
> >    Alex,  20,  13X
> >    Bob,  25,  142
> >    Carol, 24,  120
> >    John,  3BC,  175
> >    Katy,  35,  160
> >    Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
> > strip.white=TRUE)
> >
> > Now check to see if all of the fields are character as expected.
> >
> > sapply(dat1, typeof)
> > #        Name         Age      Weight
> > # "character" "character" "character"
> >
> > Now identify character variables containing numbers and numeric variables
> > containing characters:
> >
> > BadName <- which(grepl("[[:digit:]]", dat1$Name))
> > BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
> > BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
> >
> > Next remove those rows:
> >
> > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
> > #    Name Age Weight
> > #  2   Bob  25    142
> > #  3 Carol  24    120
> > #  5  Katy  35    160
> >
> > You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
> > as.numeric(dat2$Age).
> >
> > David Carlson
> >
> >
> > On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4...@gmail.com> wrote:
> >
> >> As character 'polluted' entries will cause a column to be read in (via
> >> read.table and relatives) as factor or character data, this sounds like a
> >> job for regular expressions. If you are not familiar with this subject,
> >> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
> >> This Message Is From an External Sender
> >> This message came from outside your organization.
> >> ZjQcmQRYFpfptBannerEnd
> >>
> >> As character 'polluted' entries will cause a column to be read in (via
> >> read.table and relatives) as factor or character data, this sounds like a
> >> job for regular expressions. If you are not familiar with this subject,
> >> time to learn. And, yes, some heavy lifting will be required.
> >> See ?regexp for a start maybe? Or the stringr package?
> >>
> >> Cheers,
> >> Bert
> >>
> >>
> >>
> >>
> >> On Fri, Jan 28, 2022, 7:08 PM Val <valkr...@gmail.com> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I want to remove rows that contain a character string in an integer
> >>> column or a digit in a character column.
> >>>
> >>> Sample data
> >>>
> >>> dat1 <-read.table(text="Name, Age, Weight
> >>>   Alex,  20,  13X
> >>>   Bob,   25,  142
> >>>   Carol, 24,  120
> >>>   John,  3BC,  175
> >>>   Katy,  35,  160
> >>>   Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
> >>>
> >>> If the Age/Weight column contains any character(s) then remove
> >>> if the Name  column contains an digit then remove that row
> >>> Desired output
> >>>
> >>>     Name   Age weight
> >>> 1   Bob     25    142
> >>> 2   Carol   24    120
> >>> 3   Katy    35    160
> >>>
> >>> Thank you,
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
> >>> PLEASE do read the posting guide
> >>> https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>      [[alternative HTML version deleted]]
> >>
> >> ______________________________________________r-h...@r-project.org mailing 
> >> list -- To UNSUBSCRIBE and more, 
> >> seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
> >> PLEASE do read the posting guide 
> >> https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Row exclude

Reply via email to