Sean- Aren't the scripts to generate the DB already available in the sandbox area?
Sent from my iPhone > On Sep 9, 2014, at 5:24 PM, "Finan, Sean" <sean.fi...@childrens.harvard.edu> > wrote: > > There is a tool to generate a dictionary in the new format using the UMLS > MR*** files. > > The module can also read directly from a file with bar-separated values: > CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries. > > I can send a copy of the dictionary creator jar and instructions tomorrow. > > Sean > ________________________________________ > From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com] > Sent: Tuesday, September 09, 2014 5:17 PM > To: dev@ctakes.apache.org > Subject: Re: Ctakes to process 5000K recoreds > > Sean, > > If that is a script for generating a dictionary for use with > dictionary-lookup-fast, I would also be very interested in checking it out. > > Thanks, > > Bruce > > > [image: IMAT Solutions] <http://imatsolutions.com> > Bruce Tietjen > Senior Software Engineer > [image: Mobile:] 801.634.1547 > bruce.tiet...@imatsolutions.com > > On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish < > snika...@emerginghealthit.com> wrote: > >> Great. I will do that. Thanks again. >> >> Nick >> >> -----Original Message----- >> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] >> Sent: Tuesday, September 09, 2014 4:39 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> Just use it with cTakes. Instead of removing other modules from the >> pipeline, replace the dictionary-lookup with dictionary-lookup-fast. >> >> For the >> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml >> , you would modify: >> >> <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB"> >> <import >> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/> >> </delegateAnalysisEngine> >> >> To be: >> >> <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB"> >> <import >> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/> >> </delegateAnalysisEngine> >> >> >> That should be it. You can then leave the rest of the module >> specifications alone. >> >> Sean >> >> ________________________________________ >> From: Nick Nikandish [snika...@emerginghealthit.com] >> Sent: Tuesday, September 09, 2014 4:32 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> Hi Sean, >> >> Many thanks, I will try it tomorrow. Do you have any special instruction >> to run that scrip or I have to use it with cTakes? >> >> Thanks, >> Nick >> >> -----Original Message----- >> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] >> Sent: Tuesday, September 09, 2014 4:24 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> Hi Nick, >> >> I think that the bottleneck is probably the lookup module itself. So, I >> just sent you a secure email/ftp link. It contains a build of the new >> dictionary-lookup-fast module. Should you choose to try it, let me know >> how things turn out. >> >> Sean >> ________________________________________ >> From: Nick Nikandish [snika...@emerginghealthit.com] >> Sent: Tuesday, September 09, 2014 4:10 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> Thanks, let me try it. >> Nick >> >> -----Original Message----- >> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] >> Sent: Tuesday, September 09, 2014 4:08 PM >> To: 'dev@ctakes.apache.org' >> Subject: RE: Ctakes to process 5000K recoreds >> >> If you just need the medication names, you can remove these: >> <node>ContextDependentTokenizerAnnotator</node> >> <node>DependencyParser</node> >> <node>AssertionAnnotator</node> >> >> You might be able to get rid of the LvgAnnotator and still get decent >> results since variations of word form should not affect medication names. I >> would try with it and without it on a smaller set of files and see if you >> see a difference. >> >> I believe the others are needed by the default configs for medication >> lookup. For example, POS is used to get phrase type. Phrases are used to >> remove verb phrases from the lookup and also therefore to keep the lookup >> windows from getting too big. I'm more familiar with the other types of >> named entities (diseases, symptoms, etc) than with medications. >> >> -----Original Message----- >> From: Nick Nikandish [mailto:snika...@emerginghealthit.com] >> Sent: Tuesday, September 09, 2014 3:01 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> James, >> >> Do you have any suggestion about running cTakes with minimum annotators >> that can return Medications in DictionaryLookupAnnotator? >> Thanks, >> Nick >> >> -----Original Message----- >> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] >> Sent: Tuesday, September 09, 2014 3:05 PM >> To: 'dev@ctakes.apache.org' >> Subject: RE: Ctakes to process 5000K recoreds >> >> I suspect that when you take out simple segment annotated, nothing is >> getting processed, and that is why it appears so fast. At least some of the >> annotators loop through the list of sections/segments, which is why there >> is a simple segment annotator - so that there is at least one >> section/segment identified. Are you getting any annotations at all? >> >> -----Original Message----- >> From: Nick Nikandish [mailto:snika...@emerginghealthit.com] >> Sent: Tuesday, September 09, 2014 2:02 PM >> To: dev@ctakes.apache.org >> Subject: RE: Ctakes to process 5000K recoreds >> >> Pei, >> I need the name of the medications for the application that I wrote and >> uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in >> performLookup()) and use them in my program but when I have >> SimpleSegementAnnotator it just takes forever. After taking >> SimpleSegementAnnotator out, no medication name in >> DictionaryLookupAnnotator is returned in the code. So I was wondering if >> there was a way that I could eliminate SimpleSegementAnnotator but still >> be able to get the medications name in that class? >> >> Nick >> >> -----Original Message----- >> From: Pei Chen [mailto:chen...@apache.org] >> Sent: Tuesday, September 09, 2014 2:54 PM >> To: dev@ctakes.apache.org >> Subject: Re: Ctakes to process 5000K recoreds >> >> Nick, >> When you mean no medication is being annotated, I presume you mean the >> medication attributes (i.e. dosage, frequency, etc.) are not being >> annotated? I think the DrugNER needs a list of section names in the >> config; I think it includes SIMPLE_SEGMENT. I am very surprised that >> SimpleSegementAnnotator is the bottle neck though; all it does is assume >> the entire document is a single section called SIMPLE_SEGMENT. >> Have you tried commenting out the DependencyParser if you're not using >> those features. >> >> --Pei >> >> >> On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish < >> snika...@emerginghealthit.com> wrote: >>> >>> Hi there, >>> >>> I am using Ctakes to process 5000K free text records where each record >> has several medications. >>> This is the fixed flow that it goes through: >> <node>SimpleSegmentAnnotator</node> >> <node>SentenceDetectorAnnotator</node> >> <node>TokenizerAnnotator</node> >> <node>LvgAnnotator</node> >> <node>ContextDependentTokenizerAnnotator</node> >> <node>POSTagger</node> >> <node>Chunker</node> >> <node>LookupWindowAnnotator</node> >> <node>DictionaryLookupAnnotatorDB</node> >> <node>DependencyParser</node> >> <node>AssertionAnnotator</node> >>> >>> <node>ExtractionPrepAnnotator</node> >>> >>> But it takes very very long time to process that many data( maybe a week >> or so) when I use SimpleSegmentAnnotator. By eliminating >> SimpleSegmentAnnotator the process is very fast but no medication is being >> anotated. Do you guys have any suggestion? >>> >>> Thanks, >>> Nick >>