RE: Ctakes to process 5000K records

Finan, Sean Wed, 10 Sep 2014 13:15:34 -0700

Hi Nick, 

>file:org/apache/ctakes/dictionary/fast/cTakesHsql.xml

does that file not exist under resources?  cTakes shouldn't need anything under 
that directory to be added to the classpath.

I checked the source into trunk this morning, but the zip that you downloaded 
had everything included.  As long as you unzipped in cTakes root the resources, 
desc and lib should have been properly placed.

Sean

________________________________________
From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Wednesday, September 10, 2014 3:06 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K records

Hi Sean,

I am getting this error:
org.apache.uima.resource.ResourceInitializationException: Could not access the 
resource data at file:org/apache/ctakes/dictionary/fast/cTakesHsql.xml.

Where should I add it to the classpath?

Thanks,
Nick

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:39 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Just use it with cTakes.  Instead of removing other modules from the pipeline, 
replace the dictionary-lookup with dictionary-lookup-fast.

For the 
desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
 , you would modify:

    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
      <import 
location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
    </delegateAnalysisEngine>

To be:

    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
      <import 
location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
    </delegateAnalysisEngine>

That should be it.  You can then leave the rest of the module specifications 
alone.

Sean

________________________________________
From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Sean,

Many thanks, I will try it tomorrow. Do you have any special instruction to run 
that scrip or I have to use it with cTakes?

Thanks,
Nick

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean
________________________________________
From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-----Original Message-----
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 <node>ContextDependentTokenizerAnnotator</node>
 <node>DependencyParser</node>
 <node>AssertionAnnotator</node>

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-----Original Message-----
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-----Original Message-----
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-----Original Message-----
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.....so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-----Original Message-----
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the 
medication attributes (i.e. dosage, frequency, etc.) are not being annotated?  
I think the DrugNER needs a list of section names in the config; I think it 
includes SIMPLE_SEGMENT.  I am very surprised that SimpleSegementAnnotator is 
the bottle neck though; all it does is assume the entire document is a single 
section called SIMPLE_SEGMENT.
Have you tried commenting out the DependencyParser if you're not using those 
features.

--Pei

On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <snika...@emerginghealthit.com> 
wrote:
>
> Hi there,
>
> I am using Ctakes to process 5000K free text  records  where each record has 
> several medications.
> This is the fixed flow that it goes through:
>
>                                                                
> <node>SimpleSegmentAnnotator</node>
>                                                                 
> <node>SentenceDetectorAnnotator</node>
>                                                                 
> <node>TokenizerAnnotator</node>
>                                                                 
> <node>LvgAnnotator</node>
>                                                                 
> <node>ContextDependentTokenizerAnnotator</node>
>                                                                 
> <node>POSTagger</node>
>                                                                 
> <node>Chunker</node>
>                                                                 
> <node>LookupWindowAnnotator</node>
>                                                                 
> <node>DictionaryLookupAnnotatorDB</node>
>                                                                 
> <node>DependencyParser</node>
>                                                                 
> <node>AssertionAnnotator</node>
>
> <node>ExtractionPrepAnnotator</node>
>
> But it takes very very long time to process that many data( maybe a week or 
> so) when I use SimpleSegmentAnnotator.  By eliminating SimpleSegmentAnnotator 
> the process is very fast but no medication is being anotated.  Do you guys 
> have any suggestion?
>
> Thanks,
> Nick
>

RE: Ctakes to process 5000K records

Reply via email to