Hi Christopher, This seems pretty standard and straightforward, unless I am missing something. You can do the "full join" without changing variable names. Here's a small code example with two tibbles, a and b, where the column 'x' in a corresponds to the column 'u' in b.
a <- tibble(x=1:15,y=21:35) b <- tibble(u=c(1:10,51:55),z=31:45) foo <- merge(a,b,by.x="x",by.y="u",all.x=TRUE,all.y=TRUE) foo # x y z # 1 1 21 31 # 2 2 22 32 # 3 3 23 33 # 4 4 24 34 # 5 5 25 35 # 6 6 26 36 # 7 7 27 37 # 8 8 28 38 # 9 9 29 39 # 10 10 30 40 # 11 11 31 NA # 12 12 32 NA # 13 13 33 NA # 14 14 34 NA # 15 15 35 NA # 16 51 NA 41 # 17 52 NA 42 # 18 53 NA 43 # 19 54 NA 44 # 20 55 NA 45 HTH, Eric On Mon, Jul 6, 2020 at 2:07 AM Richard M. Heiberger <r...@temple.edu> wrote: > > Have you talked directly to the designers of the new database? > One would hope that they had a clear migration path in mind. > Perhaps they just didn't document it to your satisfaction. > > Rich > > On Sun, Jul 5, 2020 at 2:51 PM Christopher W. Ryan <cr...@binghamton.edu> > wrote: > > > > I've been conducting relatively simple COVID-19 surveillance for our > > jurisdiction. We get data on lab test results automatically, and then > > interview patients to obtain other information, like clinical details. > > We had been recording all data in our long-time data system (call it > > dataSystemA). But as of a particular date, there was a major change in > > the data system we were compelled to use. Call the new one dataSystemB. > > dataSystemA and dataSystemB contain very similar information, > > conceptually, but the variable names are all different, and there are > > some variables in one that do not appear in the other. Total number of > > variables in each is about 50-70. > > > > Furthermore, for about 2 weeks prior to the transition, lab test results > > started being deposited into dataSystemB while dataSystemA was still > > being used to record the full information from the interviews. > > Subsequent to the transition, lab test results and interview information > > are being recorded in dataSystemB, while the lab test results alone are > > still being automatically deposited into dataSystemA. > > > > Diagrammatically: > > > > dataSystemA usage: ____________________ ............>> > > > > dataSystemB usage: ......._____________>> > > > > where ________ represents full data and ..... represents partial data, > > and >> represents the progress of time. > > > > > > The following will create MWE of the data wrangling problem, with the > > change in data systems made to occur overnight on 2020-07-07: > > > > library(dplyr) > > dataSystemA <- tibble(lastName = c("POTTER", "WEASLEY", "GRAINGER", > > "LONGBOTTOM"), > > firstName = c("harry", "ron", "hermione", "neville"), > > dob = as.Date(Sys.Date() + c(sample(-3650:-3000, > > size = 2), -3500, -3450)), > > onsetDate = as.Date(Sys.Date() + 1:4), > > symptomatic = c(TRUE, FALSE, NA, NA) ) > > dataSystemB <- tibble(last_name = c("GRAINGER", "LONGBOTTOM", "MALFOY", > > "LOVEGOOD", "DIGGORY"), > > first_name = c("hermione", "neville", "draco", > > "luna", "cedric"), > > birthdate = as.Date(Sys.Date() + c(-3500, -3450, > > sample(-3650:-3000, size = 3))), > > date_of_onset = as.Date(Sys.Date() + 3:7), > > symptoms_present = c(TRUE, TRUE, FALSE, FALSE, TRUE)) > > > > > > > > Obviously, this is all the same public health problem, so I don't want a > > big uninterpretable gap in my reports. I am looking for advice on the > > best strategy for combining two different tibbles with some overlap in > > observations (some patients appear in both data systems, with varying > > degrees of completeness of data) and with some of the same things being > > mesaured and recorded in the two data systems, but with different > > variable names. > > > > I've thought of two different strategies, neither of which seems ideal > > but either of which might work: > > > > 1. change the variable names in dataSystemB to match their > > conceptually-identical variables in dataSystemA, and then use some > > version of bind_rows() > > > > 2. Create a unique identifier from last names, first names, and dates of > > birth, use some type of full_join(), matching on that identifier, > > obtaining all columns from both tibbles, and then "collapse" > > conceptually-identical variables like onsetDate and date_of_onset using > > coalesce() > > > > Sorry for my long-windedness. Grateful for any advice. > > > > --Chris Ryan > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.