Hi Henrik, Thanks for pointing out the diffobj package and the clear example. Nice!
On Sun, Jan 28, 2018 at 6:22 PM, Marsh Hardy ARA/RISK <mha...@ara.com> wrote: > Thanks, I think I've found the most succinct expression of differences in > two data.frames... > > length(which( rowSums( x1 != x2 ) > 0)) > > gives a count of the # of records in two data.frames that do not match. > > // > ________________________________________ > From: Henrik Bengtsson [henrik.bengts...@gmail.com] > Sent: Sunday, January 28, 2018 11:12 AM > To: Ulrik Stervbo > Cc: Marsh Hardy ARA/RISK; r-help@r-project.org > Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > > The diffobj package (https://cran.r-project.org/package=diffobj) is > really helpful here. It provides "diff" functions diffPrint(), > diffStr(), and diffChr() to compare two object 'x' and 'y' and provide > neat colorized summary output. > > Example: > > > iris2 <- iris > > iris2[122:125,4] <- iris2[122:125,4] + 0.1 > > > diffobj::diffPrint(iris2, iris) > < iris2 > > iris > @@ 121,8 / 121,8 @@ > ~ Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 120 6.0 2.2 5.0 1.5 virginica > 121 6.9 3.2 5.7 2.3 virginica > < 122 5.6 2.8 4.9 2.1 virginica > > 122 5.6 2.8 4.9 2.0 virginica > < 123 7.7 2.8 6.7 2.1 virginica > > 123 7.7 2.8 6.7 2.0 virginica > < 124 6.3 2.7 4.9 1.9 virginica > > 124 6.3 2.7 4.9 1.8 virginica > < 125 6.7 3.3 5.7 2.2 virginica > > 125 6.7 3.3 5.7 2.1 virginica > 126 7.2 3.2 6.0 1.8 virginica > 127 6.2 2.8 4.8 1.8 virginica > > What's not show here is that the colored output (supported by many > terminals these days) also highlights exactly which elements in those > rows differ. > > /Henrik > > On Sun, Jan 28, 2018 at 12:17 AM, Ulrik Stervbo <ulrik.ster...@gmail.com> > wrote: > > The anti_join from the package dplyr might also be handy. > > > > install.package("dplyr") > > library(dplyr) > > anti_join (x1, x2) > > > > You can get help on the different functions by ?function.name(), so > > ?anti_join() will bring you help - and examples - on the anti_join > > function. > > > > It might be worth testing your approach on a small subset of the data. > That > > makes it easier for you to follow what happens and evaluate the outcome. > > > > HTH > > Ulrik > > > > Marsh Hardy ARA/RISK <mha...@ara.com> schrieb am So., 28. Jan. 2018, > 04:14: > > > >> Cool, looks like that'd do it, almost as if converting an entire record > to > >> a character string and comparing strings. > >> > >> ________________________________________ > >> From: William Dunlap [wdun...@tibco.com] > >> Sent: Saturday, January 27, 2018 4:57 PM > >> To: Marsh Hardy ARA/RISK > >> Cc: Ulrik Stervbo; Eric Berger; r-help@r-project.org > >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > >> > >> If your two objects have class "data.frame" (look at class(objectName)) > >> and they > >> both have the same number of columns and the same order of columns and > the > >> column types match closely enough (use all.equal(x1, x2) for that), then > >> you can try > >> which( rowSums( x1 != x2 ) > 0) > >> E.g., > >> > x1 <- data.frame(X=1:5, Y=rep(c("A","B"),c(3,2))) > >> > x2 <- data.frame(X=c(1,2,-3,-4,5), Y=rep(c("A","B"),c(2,3))) > >> > x1 > >> X Y > >> 1 1 A > >> 2 2 A > >> 3 3 A > >> 4 4 B > >> 5 5 B > >> > x2 > >> X Y > >> 1 1 A > >> 2 2 A > >> 3 -3 B > >> 4 -4 B > >> 5 5 B > >> > which( rowSums( x1 != x2 ) > 0) > >> [1] 3 4 > >> > >> If you want to allow small numeric differences but exactly character > >> matches > >> you will have to get a bit fancier. Splitting the data.frames into > >> character and > >> numeric parts and comparing each works well. > >> > >> Bill Dunlap > >> TIBCO Software > >> wdunlap tibco.com<http://tibco.com> > >> > >> On Sat, Jan 27, 2018 at 1:18 PM, Marsh Hardy ARA/RISK <mha...@ara.com > >> <mailto:mha...@ara.com>> wrote: > >> Hi Guys, I apologize for my rank & utter newness at R. > >> > >> I used summary() and found about 95 variables, both character and > numeric, > >> all with "Length:368842" I assume is the # of records. > >> > >> I'd like to know the record number (row #?) of any record where the data > >> doesn't match in the 2 files of what should be the same output. > >> > >> Thanks in advance, M. > >> > >> // > >> ________________________________________ > >> From: Ulrik Stervbo [ulrik.ster...@gmail.com<mailto: > >> ulrik.ster...@gmail.com>] > >> Sent: Saturday, January 27, 2018 10:00 AM > >> To: Eric Berger > >> Cc: Marsh Hardy ARA/RISK; r-help@r-project.org<mailto:r- > h...@r-project.org > >> > > >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > >> > >> Also, it will be easier to provide helpful information if you'd describe > >> what in your data you want to compare and what you hope to get out of > the > >> comparison. > >> > >> Best wishes, > >> Ulrik > >> > >> Eric Berger <ericjber...@gmail.com<mailto:ericjber...@gmail.com > ><mailto: > >> ericjber...@gmail.com<mailto:ericjber...@gmail.com>>> schrieb am Sa., > 27. > >> Jan. 2018, 08:18: > >> Hi Marsh, > >> An RDS is not a data structure such as a data.frame. It can be anything. > >> For example if I want to save my objects a, b, c I could do: > >> > saveRDS( list(a,b,c,), file="tmp.RDS") > >> Then read them back later with > >> > myList <- readRDS( "tmp.RDS" ) > >> > >> Do you have additional information about your "RDSs" ? > >> > >> Eric > >> > >> > >> On Sat, Jan 27, 2018 at 6:54 AM, Marsh Hardy ARA/RISK <mha...@ara.com > >> <mailto:mha...@ara.com><mailto:mha...@ara.com<mailto:mha...@ara.com>>> > >> wrote: > >> > >> > Each RDS is 40 MBs. What's a slick code to compare them row by row, > IDing > >> > row numbers with mismatches? > >> > > >> > Thanks in advance. > >> > > >> > // > >> > > >> > ______________________________________________ > >> > R-help@r-project.org<mailto:R-help@r-project.org><mailto: > >> R-help@r-project.org<mailto:R-help@r-project.org>> mailing list -- To > >> UNSUBSCRIBE and more, see > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide http://www.R-project.org/ > >> > posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-help@r-project.org<mailto:R-help@r-project.org><mailto: > >> R-help@r-project.org<mailto:R-help@r-project.org>> mailing list -- To > >> UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > >> ______________________________________________ > >> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > >> UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.