We had some similar work on de-id and "re-id". The impact on performance for NER tasks was minimal.
https://academic.oup.com/jamia/article/20/1/84/2909298 The replacing PHI task was employed with data based on US CENSUS distribution. https://www.sciencedirect.com/science/article/pii/S1532046414000161 ---------------------- Todd Lingren, M.S. Division of Biomedical Informatics Cincinnati Children's Hospital todd.ling...@cchmc.org (513) 803-9032 ________________________________ From: Peter Szolovits <p...@mit.edu> Sent: Wednesday, July 17, 2019 1:12:21 PM To: dev@ctakes.apache.org Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL] My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification). One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text. The same goes for other PHI categories. I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks. So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob. Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries. Date shifting also introduces pseudonymization problems. For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay. We published a paper on this topic: https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 <https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27> I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know. —Peter Szolovits > On Jul 17, 2019, at 12:42 PM, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > Hi All, > > ctakes-scrubber is not in any ctakes release and it is not in the main > repository. It never went beyond experimental and resides within the ctakes > sandbox. https://svn.apache.org/repos/asf/ctakes/sandbox/ > <https://svn.apache.org/repos/asf/ctakes/sandbox/> > > From what I recall, scrubber does not have "real" name replacement, but > instead de-identifies entities by removing them and inserting a tag > indicating the type of entity. For instance: "John has a rash" -> "[person] > has a rash". That is not verbatim, but it is the general idea. > > If you can get ctakes-scrubber working in your project then it would be > pretty easy to create an engine that does nothing except replace such generic > tags with random names, dates, institutions, etc. > > Sean > ________________________________________ > From: gandhi rajan <gandhiraja...@gmail.com <mailto:gandhiraja...@gmail.com>> > Sent: Wednesday, July 17, 2019 12:26 PM > To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org> > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL] > > Hi Masoud, we had a similar requirement to identify patient names in the > narratives text and I had a discussion with Sean Finan on patient name > identification feature in cTAKES. What he told at that point in time was > cTAKES dint supported patient name identification feature. Also as far as I > know, I m not really sure whether scrubber made it to the cTAKES codebase. > > Sean, Please correct me if I m wrong. > > On Wednesday, July 17, 2019, Masoud Rouhizadeh <m...@jhu.edu> wrote: > >> Dear cTAKES developer, >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the >> Institute for Clinical and Translational Research and work on >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major >> goals we are targeting is de-identification of a large number of notes >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she >> has been very helpful. >> >> One of our most desired features in the de-identification pipeline is >> synthetic replacement (e.g. Nancy->Sally; random female first name >> consistently replaces a female first name.). I wasn't able to find >> information about this feature in cTAKES Scrubber. Is synthetic replacement >> functionality part of the cTAKES Scrubber, or can it be added by >> post-processing the output? For instance, if we know the name Nancy is >> removed from multiple places, can we use a name dictionary to insert random >> female first names in those places (just a thought)? >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main >> candidates and I'm hoping that we could find ways to collaborate. >> >> Thank you very much, >> Masoud >> >> ---- >> Masoud Rouhizadeh, PhD >> Faculty - Division of Health Science Informatics (DHSI) >> NLP Lead - Institute for Clinical and Translational Research (ICTR) >> Johns Hopkins University School of Medicine >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=> >> >> > > -- > Regards, > Gandhi > > "The best way to find urself is to lose urself in the service of others !!!"