Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Lingren, Todd Fri, 19 Jul 2019 07:27:50 -0700

Hi Masoud,

The replacement was the same within a note, but not standardized across the 
complete record for a patient. Date shifting was also within a note, not across 
a record. The NER task doesn't really matter in this regard, and even for more 
extensive time-series info extraction/prediction, that shouldn't be relying on 
PHI anyway.


One other point about addresses, we obfuscated the road type. For example if 
the address said 123 Main Street, we would change that to 429 First Avenue, or 
something like that. And woudn't use Main Street (only Main 
Avenue/Road/Drive/Boulevard) in other replacements.



----------------------

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032


________________________________
From: Masoud Rouhizadeh <m...@jhu.edu>
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name by 
the same pseudonym makes the text easier to interpret but on the other hand, 
wouldn't it make the de-identification less robust? I think if we pick a random 
pseudonym in each instance, it would be difficult to find the real name (in 
case it is missed by the de-id system) when it is surrounded by (lots of) 
pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be helpful 
indeed.

(3) Todd: in your work, did you replace the same real person’s name by the same 
pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case that 
Peter pointed out, we would need to deal with consistency. Will shifting the 
date by a random yet consistent number across that single note is sufficient or 
should we do this at the patient level? For instance, if some signs and 
symptoms observed and reported 1 year before the diagnosis, this trajectory 
should be preserved. Age would be another issue. Some risk factors are 
age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields 
(e.g. name, DOB, SSN, contact info) to help the note de-identification system? 
if the note de-id system is aware of the person's real name, we could make it 
more sensitive to that name, or if we know the street in which the person 
lives, we can pay more attention to that in the free text. Just wondering if 
any de-id tool uses this information systematically?

Thank you all!
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd" <todd.ling...@cchmc.org> wrote:

    We had some similar work on de-id and "re-id".

    The impact on performance for NER tasks was minimal.

    https://academic.oup.com/jamia/article/20/1/84/2909298

    The replacing PHI task was employed with data based on US CENSUS 
distribution.

    https://www.sciencedirect.com/science/article/pii/S1532046414000161



    ----------------------

    Todd Lingren, M.S.
    Division of Biomedical Informatics
    Cincinnati Children's Hospital
    todd.ling...@cchmc.org
    (513) 803-9032


    ________________________________
    From: Peter Szolovits <p...@mit.edu>
    Sent: Wednesday, July 17, 2019 1:12:21 PM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

    My group has done considerable work on de-identification and on 
synthesizing pseudonymous data to replace the original PHI with plausible but 
inauthentic data (sometimes confusingly called re-identification).

    One conclusion I reached from that work is that the de-identification and 
the pseudonym generation should be tightly coupled. For example, if de-id 
replaces all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

    I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

    Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospital stay.

    We published a paper on this topic:
    https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 
<https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

    I also have some old Java code that deal with a few of these issues, and 
would be happy to share with anyone interested, though it’s far from production 
quality and does not address all the issues we know.

    —Peter Szolovits

    > On Jul 17, 2019, at 12:42 PM, Finan, Sean 
<sean.fi...@childrens.harvard.edu> wrote:
    >
    > Hi All,
    >
    > ctakes-scrubber is not in any ctakes release and it is not in the main 
repository.  It never went beyond experimental and resides within the ctakes 
sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ 
<https://svn.apache.org/repos/asf/ctakes/sandbox/>
    >
    > From what I recall, scrubber does not have "real" name replacement, but 
instead de-identifies entities by removing them and inserting a tag indicating 
the type of entity.  For instance: "John has a rash" -> "[person] has a rash".  
 That is not verbatim, but it is the general idea.
    >
    > If you can get ctakes-scrubber working in your project then it would be 
pretty easy to create an engine that does nothing except replace such generic 
tags with random names, dates, institutions, etc.
    >
    > Sean
    > ________________________________________
    > From: gandhi rajan <gandhiraja...@gmail.com 
<mailto:gandhiraja...@gmail.com>>
    > Sent: Wednesday, July 17, 2019 12:26 PM
    > To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
    > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    >
    > Hi Masoud, we had a similar requirement to identify patient names in the
    > narratives text and I had a discussion with Sean Finan on patient name
    > identification feature in cTAKES. What he told at that point in time was
    > cTAKES dint supported patient name identification feature. Also as far as 
I
    > know, I m not really sure whether scrubber made it to the cTAKES codebase.
    >
    > Sean, Please correct me if I m wrong.
    >
    > On Wednesday, July 17, 2019, Masoud Rouhizadeh <m...@jhu.edu> wrote:
    >
    >> Dear cTAKES developer,
    >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
    >> Institute for Clinical and Translational Research and work on
    >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
    >> goals we are targeting is de-identification of a large number of notes
    >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). 
I
    >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
    >> has been very helpful.
    >>
    >> One of our most desired features in the de-identification pipeline is
    >> synthetic replacement (e.g. Nancy->Sally; random female first name
    >> consistently replaces a female first name.). I wasn't able to find
    >> information about this feature in cTAKES Scrubber. Is synthetic 
replacement
    >> functionality part of the cTAKES Scrubber, or can it be added by
    >> post-processing the output? For instance, if we know the name Nancy is
    >> removed from multiple places, can we use a name dictionary to insert 
random
    >> female first names in those places (just a thought)?
    >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
    >> candidates and I'm hoping that we could find ways to collaborate.
    >>
    >> Thank you very much,
    >> Masoud
    >>
    >> ----
    >> Masoud Rouhizadeh, PhD
    >> Faculty - Division of Health Science Informatics (DHSI)
    >> NLP Lead - Institute for Clinical and Translational Research (ICTR)
    >> Johns Hopkins University School of Medicine
    >> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
    >>
    >>
    >
    > --
    > Regards,
    > Gandhi
    >
    > "The best way to find urself is to lose urself in the service of others 
!!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Reply via email to