Re: [R] remove

P Tennant Sun, 12 Feb 2017 23:30:08 -0800

Val,

Working with R's special missing value indicator (NA) would be usefulhere. You could use the na.strings arg in read.table() to recognise "-"as a missing value:


dfr <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE, na.strings = c("NA", "-"))

and then modify the function used by ave() or by() to exclude missingvalues from the count of unique last names. Here's one approach adaptingcode from earlier in this thread:

err <- ave(dfr$last, dfr$first, FUN = function(x)length(unique(x[!is.na(x)])))

res <- dfr[err == 1 , ]
res <- res[order(res$first) , ]
res

  first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John
3  Cory    1 Jack
4  Cory    2 <NA>

Alternatively, if not using na.strings, change "-" to NA after firstreading the data in: identify last names recorded as "-" using an index,and assign NA to these elements, before proceeding as above.


Philip

On 13/02/2017 3:18 PM, Val wrote:

Hi Jeff and All,

When I examined the excluded  data,  ie.,  first name with  with
different last names, I noticed that  some last names were  not
recorded
or instance, I modified the data as follows
DF<- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2     -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )


err2<- ave( seq_along( DF$first )
            , DF[ , "first", drop = FALSE]
            , FUN = function( n ) {
               length( unique( DF[ n, "last" ] ) )
              }
            )
result2<- DF[ 1 == err2, ]
result2

first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John

However, I want keep Cory's record. It is assumed that not recorded
should have the same last name.

Final out put should be

first week last
    Bob    1 John
    Bob    2 John
    Bob    3 John
   Cory    1  Jack
   Cory    2   -

Thank you again!

On Sun, Feb 12, 2017 at 7:28 PM, Val<valkr...@gmail.com>  wrote:

Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
      vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]

I did get different results but now I found out the problem.

Thank you!.








On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<jdnew...@dcn.davis.ca.us>  wrote:

Your question mystifies me, since it looks to me like you already know the 
answer.
--
Sent from my phone. Please excuse my brevity.

On February 12, 2017 3:30:49 PM PST, Val<valkr...@gmail.com>  wrote:

Hi Jeff and all,
How do I get the  number of unique first names   in the two data sets?

for the first one,
result2<- DF[ 1 == err2, ]
length(unique(result2$first))




On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<jdnew...@dcn.davis.ca.us>  wrote:

The "by" function aggregates and returns a result with generally

fewer rows

than the original data. Since you are looking to index the rows in

the

original data set, the "ave" function is better suited because it

always

returns a vector that is just as long as the input vector:

# I usually work with character data rather than factors if I plan
# to modify the data (e.g. removing rows)
DF<- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )

err<- ave( DF$last
           , DF[ , "first", drop = FALSE]
           , FUN = function( lst ) {
               length( unique( lst ) )
             }
           )
result<- DF[ "1" == err, ]
result

Notice that the ave function returns a vector of the same type as was

given

to it, so even though the function returns a numeric the err
vector is character.

If you wanted to be able to examine more than one other column in
determining the keep/reject decision, you could do:

err2<- ave( seq_along( DF$first )
            , DF[ , "first", drop = FALSE]
            , FUN = function( n ) {
               length( unique( DF[ n, "last" ] ) )
              }
            )
result2<- DF[ 1 == err2, ]
result2

and then you would have the option to re-use the "n" index to look at

other

columns as well.

Finally, here is a dplyr solution:

library(dplyr)
result3<- (   DF
            %>% group_by( first ) # like a prep for ave or by
            %>% mutate( err = length( unique( last ) ) ) # similar to

ave

            %>% filter( 1 == err ) # drop the rows with too many last

names

            %>% select( -err ) # drop the temporary column
            %>% as.data.frame # convert back to a plain-jane data

frame

            )
result3

which uses a small set of verbs in a pipeline of functions to go from

input

to result in one pass.

If your data set is really big (running out of memory big) then you

might

want to investigate the data.table or sqlite packages, either of

which can

be combined with dplyr to get a standardized syntax for managing

larger

amounts of data. However, most people actually aren't running out of

memory

so in most cases the extra horsepower isn't actually needed.


On Sun, 12 Feb 2017, P Tennant wrote:

Hi Val,

The by() function could be used here. With the dataframe dfr:

# split the data by first name and check for more than one last name

for

each first name
res<- by(dfr, dfr['first'], function(x) length(unique(x$last))>  1)
# make the result more easily manipulated
res<- as.table(res)
res
# first
# Alex   Bob  Cory
# TRUE FALSE FALSE

# then use this result to subset the data
nw.dfr<- dfr[!dfr$first %in% names(res[res]) , ]
# sort if needed
nw.dfr[order(nw.dfr$first) , ]

  first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John
3  Cory    1 Jack
4  Cory    2 Jack


Philip

On 12/02/2017 4:02 PM, Val wrote:

Hi all,
I have a big data set and want to  remove rows conditionally.
In my data file  each person were recorded  for several weeks.

Somehow

during the recording periods, their last name was misreported.

For

each person,   the last name should be the same. Otherwise remove

from

the data. Example, in the following data set, Alex was found to

have

two last names .

Alex   West
Alex   Joseph

Alex should be removed  from the data.  if this happens then I want
remove  all rows with Alex. Here is my data set

df<- read.table(header=TRUE, text='first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West ')

Desired output

        first  week last
1     Bob     1   John
2     Bob     2   John
3     Bob     3   John
4     Cory     1   Jack
5     Cory     2   Jack

Thank you in advance

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go

Live...

DCN:<jdnew...@dcn.davis.ca.us>         Basics: ##.#.       ##.#.  Live

Go...

                                       Live:   OO#.. Dead: OO#..

Playing

Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.

rocks...1k
---------------------------------------------------------------------------


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] remove

Reply via email to