I can second Sean's thank you, it is good to have this feedback. The ClearTK 
machine learning models were made the default after we ran some experiments 
that found it performed better across a range of standard datasets than 
rule-based algorithms or the existing cTAKES module 
(http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112774). 
Since making them the default, though, we have heard from people and had our 
own experience conflict with those experiments. And certainly the errors in the 
rule-based system are easier to understand.

Just curious, are you able to characterize the errors you see from the ClearTK 
system? I did some experiments recently on a new dataset comparing negex with 
the cleartk negation module and found that there was a precision/recall 
tradeoff but almost identical F1 scores. But for that dataset the tradeoff 
negex provided was preferred by our collaborators. (I think negex had better 
recall of negated terms but worse precision).

Tim



________________________________________
From: Finan, Sean <sean.fi...@childrens.harvard.edu>
Sent: Wednesday, October 19, 2016 10:53 AM
To: dev@ctakes.apache.org
Subject: RE: Best combination of analysis engines to consider negation, family 
history, uncertainty, etc.

Hi Yiming,



Thank you very much for letting the community know what has and has not worked 
for you.  I have also had better results with the Assertion annotators than the 
ClearTk alternatives, but that could be because of the note types/formats that 
I am using.



Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) is used 
to train machine learning models for detection of the indicated property.  You 
can find information on ClearTk starting here:  
https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=



If you prefer to read a paper, you can check out 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitdO_-i4e387tjM&e=



Others no the devlist can provide much more information than can I, so you 
could post a question if you like.



Cheers,

Sean



-----Original Message-----

From: Zuo Yiming [mailto:yiming...@gmail.com]

Sent: Wednesday, October 19, 2016 10:04 AM

To: u...@ctakes.apache.org; dev@ctakes.apache.org

Subject: Best combination of analysis engines to consider negation, family 
history, uncertainty, etc.



Hi everyone,



I've spent the last a few months working on a clinical NLP project using 
cTAKES. It's a very complex system to me and every time I dig into it some new 
discoveries will come out. Since last week, I tried to figure out which 
analysis engine can help to do a good job to consider cases like negation, 
family history, uncertainty, etc. By now, I had some experience and would like 
to share with the community.



The best combination for me is to use assertionMiniPipelineAnalysisEngine

for negation, uncertainty, generic and subject detection, and 
HistoryCleartkAnalysisEngine for history detection. Both engines are in 
desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine also 
claims to be useful for conditional detection, which I haven't verified using 
my test files yet.



I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. The 
default analysis engines in AggregatePlaintextFastUMLSProcessor for negation, 
uncertainty, generic, etc. are StatusAnnotator + NegationAnnotator + 
PolarityCleartkAnalysisEngine + SubjectCleartkAnalysisEngine + 
UncertaintyCleartkAnalysisEngine + GenericCleartkAnalysisEngine + 
HistoryCleartkAnalysisEngine. It looks like in the node part, StatusAnnotator 
and NegationAnnotator are commented out, so only the remaining five analysis 
engines are actually used and all of them are in the same desc/ctakes-assertion 
folder. These five analysis engines were not effective in my test files and I'm 
still confused by their relationship to the assertionaAnalysisEngine, 
conceptConverterAnalysisEngine, GenericAttributeAnalysisEngine and 
SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.

It looks to me the Clear in their names indicate something but I couldn't 
figure it out without going through the java code, which I intend not to do at 
this level.



That's pretty much all of it for now. Anyone familiar with this topic are 
welcome to jump in to provide my insights or correction. Hopefully, we can have 
a nice discussion that can be useful to other users and developers.



ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than 
AggregatePlaintextProcessor is that I find the preferred words property in the 
former very useful while it can't be detected using the latter.



Best,

Yiming

--

Yiming Zuo 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=>
 Georgetown U. Medical Center:

Dr. Ressom's Omics Lab 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=>
 ECE Department of Virginia Tech:

Computational Bioinformatics & Bio-imaging Laboratory 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_RLjxejH2jMJUq8yFaTPjWAar4&e=>

Reply via email to