Sean, If that is a script for generating a dictionary for use with dictionary-lookup-fast, I would also be very interested in checking it out.
Thanks, Bruce [image: IMAT Solutions] <http://imatsolutions.com> Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish < snika...@emerginghealthit.com> wrote: > Great. I will do that. Thanks again. > > Nick > > -----Original Message----- > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > Sent: Tuesday, September 09, 2014 4:39 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > Just use it with cTakes. Instead of removing other modules from the > pipeline, replace the dictionary-lookup with dictionary-lookup-fast. > > For the > desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml > , you would modify: > > <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB"> > <import > location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/> > </delegateAnalysisEngine> > > To be: > > <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB"> > <import > location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/> > </delegateAnalysisEngine> > > > That should be it. You can then leave the rest of the module > specifications alone. > > Sean > > ________________________________________ > From: Nick Nikandish [snika...@emerginghealthit.com] > Sent: Tuesday, September 09, 2014 4:32 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > Hi Sean, > > Many thanks, I will try it tomorrow. Do you have any special instruction > to run that scrip or I have to use it with cTakes? > > Thanks, > Nick > > -----Original Message----- > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > Sent: Tuesday, September 09, 2014 4:24 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > Hi Nick, > > I think that the bottleneck is probably the lookup module itself. So, I > just sent you a secure email/ftp link. It contains a build of the new > dictionary-lookup-fast module. Should you choose to try it, let me know > how things turn out. > > Sean > ________________________________________ > From: Nick Nikandish [snika...@emerginghealthit.com] > Sent: Tuesday, September 09, 2014 4:10 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > Thanks, let me try it. > Nick > > -----Original Message----- > From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] > Sent: Tuesday, September 09, 2014 4:08 PM > To: 'dev@ctakes.apache.org' > Subject: RE: Ctakes to process 5000K recoreds > > If you just need the medication names, you can remove these: > <node>ContextDependentTokenizerAnnotator</node> > <node>DependencyParser</node> > <node>AssertionAnnotator</node> > > You might be able to get rid of the LvgAnnotator and still get decent > results since variations of word form should not affect medication names. I > would try with it and without it on a smaller set of files and see if you > see a difference. > > I believe the others are needed by the default configs for medication > lookup. For example, POS is used to get phrase type. Phrases are used to > remove verb phrases from the lookup and also therefore to keep the lookup > windows from getting too big. I'm more familiar with the other types of > named entities (diseases, symptoms, etc) than with medications. > > -----Original Message----- > From: Nick Nikandish [mailto:snika...@emerginghealthit.com] > Sent: Tuesday, September 09, 2014 3:01 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > James, > > Do you have any suggestion about running cTakes with minimum annotators > that can return Medications in DictionaryLookupAnnotator? > Thanks, > Nick > > -----Original Message----- > From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] > Sent: Tuesday, September 09, 2014 3:05 PM > To: 'dev@ctakes.apache.org' > Subject: RE: Ctakes to process 5000K recoreds > > I suspect that when you take out simple segment annotated, nothing is > getting processed, and that is why it appears so fast. At least some of the > annotators loop through the list of sections/segments, which is why there > is a simple segment annotator - so that there is at least one > section/segment identified. Are you getting any annotations at all? > > -----Original Message----- > From: Nick Nikandish [mailto:snika...@emerginghealthit.com] > Sent: Tuesday, September 09, 2014 2:02 PM > To: dev@ctakes.apache.org > Subject: RE: Ctakes to process 5000K recoreds > > Pei, > I need the name of the medications for the application that I wrote and > uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in > performLookup()) and use them in my program but when I have > SimpleSegementAnnotator it just takes forever. After taking > SimpleSegementAnnotator out, no medication name in > DictionaryLookupAnnotator is returned in the code. So I was wondering if > there was a way that I could eliminate SimpleSegementAnnotator but still > be able to get the medications name in that class? > > Nick > > -----Original Message----- > From: Pei Chen [mailto:chen...@apache.org] > Sent: Tuesday, September 09, 2014 2:54 PM > To: dev@ctakes.apache.org > Subject: Re: Ctakes to process 5000K recoreds > > Nick, > When you mean no medication is being annotated, I presume you mean the > medication attributes (i.e. dosage, frequency, etc.) are not being > annotated? I think the DrugNER needs a list of section names in the > config; I think it includes SIMPLE_SEGMENT. I am very surprised that > SimpleSegementAnnotator is the bottle neck though; all it does is assume > the entire document is a single section called SIMPLE_SEGMENT. > Have you tried commenting out the DependencyParser if you're not using > those features. > > --Pei > > > On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish < > snika...@emerginghealthit.com> wrote: > > > > Hi there, > > > > I am using Ctakes to process 5000K free text records where each record > has several medications. > > This is the fixed flow that it goes through: > > > > > <node>SimpleSegmentAnnotator</node> > > > <node>SentenceDetectorAnnotator</node> > > > <node>TokenizerAnnotator</node> > > > <node>LvgAnnotator</node> > > > <node>ContextDependentTokenizerAnnotator</node> > > > <node>POSTagger</node> > > > <node>Chunker</node> > > > <node>LookupWindowAnnotator</node> > > > <node>DictionaryLookupAnnotatorDB</node> > > > <node>DependencyParser</node> > > > <node>AssertionAnnotator</node> > > > > <node>ExtractionPrepAnnotator</node> > > > > But it takes very very long time to process that many data( maybe a week > or so) when I use SimpleSegmentAnnotator. By eliminating > SimpleSegmentAnnotator the process is very fast but no medication is being > anotated. Do you guys have any suggestion? > > > > Thanks, > > Nick > > >