We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.

https://academic.oup.com/jamia/article/20/1/84/2909298

The replacing PHI task was employed with data based on US CENSUS distribution.

https://www.sciencedirect.com/science/article/pii/S1532046414000161



----------------------

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032


________________________________
From: Peter Szolovits <p...@mit.edu>
Sent: Wednesday, July 17, 2019 1:12:21 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

My group has done considerable work on de-identification and on synthesizing 
pseudonymous data to replace the original PHI with plausible but inauthentic 
data (sometimes confusingly called re-identification).

One conclusion I reached from that work is that the de-identification and the 
pseudonym generation should be tightly coupled. For example, if de-id replaces 
all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 
<https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and would 
be happy to share with anyone interested, though it’s far from production 
quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
>
> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main 
> repository.  It never went beyond experimental and resides within the ctakes 
> sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ 
> <https://svn.apache.org/repos/asf/ctakes/sandbox/>
>
> From what I recall, scrubber does not have "real" name replacement, but 
> instead de-identifies entities by removing them and inserting a tag 
> indicating the type of entity.  For instance: "John has a rash" -> "[person] 
> has a rash".   That is not verbatim, but it is the general idea.
>
> If you can get ctakes-scrubber working in your project then it would be 
> pretty easy to create an engine that does nothing except replace such generic 
> tags with random names, dates, institutions, etc.
>
> Sean
> ________________________________________
> From: gandhi rajan <gandhiraja...@gmail.com <mailto:gandhiraja...@gmail.com>>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>
> Sean, Please correct me if I m wrong.
>
> On Wednesday, July 17, 2019, Masoud Rouhizadeh <m...@jhu.edu> wrote:
>
>> Dear cTAKES developer,
>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>> Institute for Clinical and Translational Research and work on
>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>> goals we are targeting is de-identification of a large number of notes
>> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>> has been very helpful.
>>
>> One of our most desired features in the de-identification pipeline is
>> synthetic replacement (e.g. Nancy->Sally; random female first name
>> consistently replaces a female first name.). I wasn't able to find
>> information about this feature in cTAKES Scrubber. Is synthetic replacement
>> functionality part of the cTAKES Scrubber, or can it be added by
>> post-processing the output? For instance, if we know the name Nancy is
>> removed from multiple places, can we use a name dictionary to insert random
>> female first names in those places (just a thought)?
>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>> candidates and I'm hoping that we could find ways to collaborate.
>>
>> Thank you very much,
>> Masoud
>>
>> ----
>> Masoud Rouhizadeh, PhD
>> Faculty - Division of Health Science Informatics (DHSI)
>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
>> Johns Hopkins University School of Medicine
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
>>
>>
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others !!!"

Reply via email to