I would like to conduct a survival analysis, examining a subject's time to *next* appearance in a database, after their first appearance. It is a database of dated events.
I need to obfuscate or anonymize or mask the subject identifiers (a combination of name and birthdate). And obviously any given subject should have the same anonymous code ever time he/she appears in the database. I'm not talking "safe from the NSA" here. And I won't be releasing it. It's just sensitive data and I don't want to be working every day with cleartext versions of it. I've looked at packages digest, anonymizer, and anonymize. What do you think of this approach: # running R 3.1.1 on Windows 7 Enterprise library(digest) dd <- data.frame(id=1:6, name = c("Harry", "Ron", "Hermione", "Luna", "Ginny", "Harry"), dob = c("1990-01-01", "1990-06-15", "1990-04-08", "1999-11-26", "1990-07-21", "1990-01-01")) dd.2 <- transform(dd, code=paste0(tolower(name), tolower(dob), sep="")) library(digest) anonymize <- function(x, algo="sha256"){ unq_hashes <- vapply(x, function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE) unname(unq_hashes[x]) } dd.2$codex <- anonymize(dd.2$code) dd.2 table(duplicated(dd.2$codex)) Thanks. --Chris Ryan Broome County Health Department ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.