Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Peter Klügl Thu, 10 Mar 2016 11:57:41 -0800

Hi,

Am 10.03.2016 um 20:29 schrieb Azad Dehghan:

Thanks Peter,


The rules were modeled using the training data.

This means both training data folders? I have access to the data but notto the challenge description.

It would be good to incorporate/refactor (basically, GATE API needs to be
replaced with UIMA API to generate annotation) the two-pass recognition
method for cTAKES - which has a wider application on longitudinal data.
This method is used on-top of a number NERs.


I'll take a look.

I do not know how much time I can invest this month. Let's see how manyphases I can translate.

I added the rules for age. Are there jape rules for creating dateannotations?

After all rules are translated, they need some major refactoring. Japeand Ruta are quite different in some aspects.


Best,

Peter

Please let me know where I can help. I will be available again in April.

Cheers,
Azad

On 10 March 2016 at 13:13, Peter Klügl <[email protected]> wrote:

Hi,

sorry, I was quite busy last month.

I added a new patch, which needs to be applied.

No new rules, but it's possible now to evaluate everything against the
labelled data of the challenge.

@Azad:
Which documents exactly did you use to develop the rules?
training-PHI-Gold-Set1, training-PHI-Gold-Set2 or testing-PHI-Gold-fixed?

Best,

Peter

Am 03.02.2016 um 09:05 schrieb Peter Klügl:

Hi,

the last patch fixed almost all problems.

I added another one that adds the csv file for the unit test and extends
svn-ignore.

Best,

Peter

Am 02.02.2016 um 09:16 schrieb Peter Klügl:

Hi,

I added another patch. I missed to manually add one test file to version
control, and there are still duplicate lines.
I hope this patch fixes the remaining problems.

Best,

Peter


Am 29.01.2016 um 10:34 schrieb Peter Klügl:

Hi,

the problems were caused by the svn client in my Eclipse. Sorry for the
trouble, I should have looked more closely at the ciomplete patch.

I attached a new patch created with commandline tools wich looks

correct

now.

Pei, can you apply the new patch?

Best,

Peter

Am 28.01.2016 um 15:57 schrieb Peter Klügl:

Thanks Pei.

I fear there was again a problem with the patch. All new files are
missing (and also the svn-ignore settings).

Can you take a look?

Best,

Peter

Am 28.01.2016 um 14:43 schrieb Pei Chen:

patch applied.
Thanks,
Pei

On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <

[email protected]> wrote:

Hi Pei,

can you commit the recent patch for us?

CTAKES-384-20160120.patch

Best,

Peter

Am 20.01.2016 um 19:35 schrieb Pei Chen:

Hi,
Sorry I was swamped recently.
But yeah, we can even create an extended type system to store

these items temporarily and add them into the main/core type system
afterwards.

There was an existing item to upgrade UIMA, but agreed- it will

require much more testing.  If it works, we can upgrade it in our sandbox
area or create a branch if necessary.

—Pei

On Jan 18, 2016, at 9:06 AM, Peter Klügl <

[email protected]> wrote:

Hi,

a new patch is attached.

@Pei:
are there suitable annotation types in the cTAKES type system?

Some

project in cTAKES uses something like OntologyMatch... I map it to
IdentifiedAnnotation right now, but there are many empty

features...

@Azad:
I changed the rules a bit, especially the capitalization like I

use it

in ruta normally. The wordlist are compiled to a trie by the maven
plugin. I also added the two regexes for url and email. I

extended the

regex for the url. I also changed the evaluation order of some

rules

(with @). Feel free to add simple examples to examples.csv for

the unit

tests.

Let me know if you need more information about the changes.

Do you wanna have help with the other rule sets? Or should we

split them up?

Best,

Peter

Am 18.01.2016 um 11:04 schrieb Peter Klügl:

Hi,

great. I will integrate them in the project and in the next

patch.

Best,

Peter

Am 18.01.2016 um 00:58 schrieb Azad Dehghan:

Three NERs translated and uploaded.

PS. I will validate all NERs once we have them all completed.

Cheers,
Azad

On 24 November 2015 at 10:37, Azad Dehghan <

[email protected]> wrote:

This is on my todo list for Dec. as well. If there are any

more volunteers

for translating JAPE to RUTA, please get in touch.

Cheers,
Azad

On 24 Nov 2015 09:55, "Peter Klügl" <[email protected]>

wrote:

Hi,

I just wanted to mention that I haven't forgot about it.

Unfortunately,

there is just no spare time right now. I hope I will be able

to provide

the patches in December.

Best,

Peter

Am 06.11.2015 um 16:40 schrieb Pei Chen:

Hi Peter,
I think the ctakes-examples is probably a good starting

point at least

in terms of maven modules, etc.  I think it would be good if

we use

uimaFIT style as primary approach to wiring components

together and

generate desc's as secondary...
I think the actual components that would be required is

probably best

left up to what is actually required for best performing

c-deid.  The

output would be interesting, I'm not sure if we should treat

this as

an independent preprocessing component or part of a pipeline

(in which

case, we may need to propose a change to the type system or

perhaps an

alternative JCas view.  You can probably open up that

discussion to

the dev group as you see fit.)

My 2 cents...


On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <

[email protected]>

wrote:

Hi,

Is there a cTAKES project that may serve as an example on

how the

cTAKES

community develops or how a project should look like?
I learned that different people set up UIMA project in a

quite

different

manner and I do not what to get inspired by "some sort of

out-dated"

approach in the cTAKES repo.

Are there restriction or preferences about the preprocessing

components

that should be used and the kind of "output" of the project.
Components: On which components may the componetns rely:

tokenizer,

...

parser, ... dict lookup?
"output": Should the project provide a pipeline or a single

AE?

More comments below.

Am 03.11.2015 um 16:54 schrieb Azad Dehghan:

Who else plans to provide patches for it? Just to avoid

duplicate

work

and to coordnate the efforts ...

I would like to help with the translating JAPE to RUTA.

You can already go ahead with the UIMA Ruta Workbench if

you want, or

wait until I set up the project with ruta integration.

If any questions arise, just ask :-)

Is there a development dataset which was utilized for the

initial

development, and if yes, is it possible to contribute it

too?

The data set is unfortunately not publicly available; i2b2
<https://www.i2b2.org/NLP/DataSets/Main.php> typically

releases the

data

sets 12 months after a given challenge; this is done on an

individual basis

and involve a Data Use Agreement.

However, I will be able to conduct and coordinate the

validation.

Ok, I'll investigate if we have already access to the

dataset here.

My first step would be:
- set up a maven project
- set up a development pipeline in a test (with cTAKES

components

replacing the previous ANNIE preprocessing)


But one item that we need to review is the 3rd party libs

jars that

were included to ensure compatibility.  I’ll be sure to

take a look

at

that over the next few weeks.

—Pei

@Pei - once ANNIE components are replaced there is should

not be a

need to

worry about the 3rd party libs.

Also, just a thought: we may want to create an independent

component

for

the Two Pass recognition (TwoPass.java) as this method

have shown

useful

for general NER on longitudinal data and surely useful

independent

of the

deid component.


Cheers,
Azad

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Reply via email to