Re: Ctakes to process 5000K recoreds

Chen, Pei Tue, 09 Sep 2014 14:27:06 -0700

Sean-
Aren't the scripts to generate the DB already available in the sandbox area?


Sent from my iPhone

> On Sep 9, 2014, at 5:24 PM, "Finan, Sean" <sean.fi...@childrens.harvard.edu> 
> wrote:
> 
> There is a tool to generate a dictionary in the new format using the UMLS 
> MR*** files.  
> 
> The module can also read directly from a file with bar-separated values:  
> CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.
> 
> I can send a copy of the dictionary creator jar and instructions tomorrow.
> 
> Sean
> ________________________________________
> From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
> Sent: Tuesday, September 09, 2014 5:17 PM
> To: dev@ctakes.apache.org
> Subject: Re: Ctakes to process 5000K recoreds
> 
> Sean,
> 
> If that is a script for generating a dictionary for use with
> dictionary-lookup-fast, I would also be very interested in checking it out.
> 
> Thanks,
> 
> Bruce
> 
> 
> [image: IMAT Solutions] <http://imatsolutions.com>
> Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
> 
> On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
> snika...@emerginghealthit.com> wrote:
> 
>> Great. I will do that. Thanks again.
>> 
>> Nick
>> 
>> -----Original Message-----
>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:39 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> Just use it with cTakes.  Instead of removing other modules from the
>> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>> 
>> For the
>> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
>> , you would modify:
>> 
>>    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>      <import
>> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>>    </delegateAnalysisEngine>
>> 
>> To be:
>> 
>>    <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>      <import
>> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>>    </delegateAnalysisEngine>
>> 
>> 
>> That should be it.  You can then leave the rest of the module
>> specifications alone.
>> 
>> Sean
>> 
>> ________________________________________
>> From: Nick Nikandish [snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> Hi Sean,
>> 
>> Many thanks, I will try it tomorrow. Do you have any special instruction
>> to run that scrip or I have to use it with cTakes?
>> 
>> Thanks,
>> Nick
>> 
>> -----Original Message-----
>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:24 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> Hi Nick,
>> 
>> I think that the bottleneck is probably the lookup module itself.  So, I
>> just sent you a secure email/ftp link.  It contains a build of the new
>> dictionary-lookup-fast module.  Should you choose to try it, let me know
>> how things turn out.
>> 
>> Sean
>> ________________________________________
>> From: Nick Nikandish [snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:10 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> Thanks, let me try it.
>> Nick
>> 
>> -----Original Message-----
>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>> Sent: Tuesday, September 09, 2014 4:08 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> If you just need the medication names, you can remove these:
>> <node>ContextDependentTokenizerAnnotator</node>
>> <node>DependencyParser</node>
>> <node>AssertionAnnotator</node>
>> 
>> You might be able to get rid of the LvgAnnotator and still get decent
>> results since variations of word form should not affect medication names. I
>> would try with it and without it on a smaller set of files and see if you
>> see a difference.
>> 
>> I believe the others are needed by the default configs for medication
>> lookup. For example, POS is used to get phrase type. Phrases are used to
>> remove verb phrases from the lookup and also therefore to keep the lookup
>> windows from getting too big.  I'm more familiar with the other types of
>> named entities (diseases, symptoms, etc) than with medications.
>> 
>> -----Original Message-----
>> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 3:01 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> James,
>> 
>> Do you have any suggestion about running cTakes with minimum annotators
>> that can return Medications in DictionaryLookupAnnotator?
>> Thanks,
>> Nick
>> 
>> -----Original Message-----
>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>> Sent: Tuesday, September 09, 2014 3:05 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> I suspect that when you take out simple segment annotated, nothing is
>> getting processed, and that is why it appears so fast. At least some of the
>> annotators loop through the list of sections/segments, which is why there
>> is a simple segment annotator - so that there is at least one
>> section/segment identified. Are you getting any annotations at all?
>> 
>> -----Original Message-----
>> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 2:02 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>> 
>> Pei,
>> I need the name of the medications for the application that I wrote and
>> uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in
>> performLookup()) and use them in my program but when I have
>> SimpleSegementAnnotator it just takes forever. After taking
>> SimpleSegementAnnotator out, no medication name in
>> DictionaryLookupAnnotator is returned in the code. So I was wondering if
>> there was a way that I could eliminate SimpleSegementAnnotator but still
>> be  able to get the medications name in that class?
>> 
>> Nick
>> 
>> -----Original Message-----
>> From: Pei Chen [mailto:chen...@apache.org]
>> Sent: Tuesday, September 09, 2014 2:54 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Ctakes to process 5000K recoreds
>> 
>> Nick,
>> When you mean no medication is being annotated, I presume you mean the
>> medication attributes (i.e. dosage, frequency, etc.) are not being
>> annotated?  I think the DrugNER needs a list of section names in the
>> config; I think it includes SIMPLE_SEGMENT.  I am very surprised that
>> SimpleSegementAnnotator is the bottle neck though; all it does is assume
>> the entire document is a single section called SIMPLE_SEGMENT.
>> Have you tried commenting out the DependencyParser if you're not using
>> those features.
>> 
>> --Pei
>> 
>> 
>> On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <
>> snika...@emerginghealthit.com> wrote:
>>> 
>>> Hi there,
>>> 
>>> I am using Ctakes to process 5000K free text  records  where each record
>> has several medications.
>>> This is the fixed flow that it goes through:
>> <node>SimpleSegmentAnnotator</node>
>> <node>SentenceDetectorAnnotator</node>
>> <node>TokenizerAnnotator</node>
>> <node>LvgAnnotator</node>
>> <node>ContextDependentTokenizerAnnotator</node>
>> <node>POSTagger</node>
>> <node>Chunker</node>
>> <node>LookupWindowAnnotator</node>
>> <node>DictionaryLookupAnnotatorDB</node>
>> <node>DependencyParser</node>
>> <node>AssertionAnnotator</node>
>>> 
>>> <node>ExtractionPrepAnnotator</node>
>>> 
>>> But it takes very very long time to process that many data( maybe a week
>> or so) when I use SimpleSegmentAnnotator.  By eliminating
>> SimpleSegmentAnnotator the process is very fast but no medication is being
>> anotated.  Do you guys have any suggestion?
>>> 
>>> Thanks,
>>> Nick
>>

Re: Ctakes to process 5000K recoreds

Reply via email to