Hi Tomasz, Ctakes lookup (both original and fast-) is case insensitive by design. There have been brief discussions on changing this behavior, but things like capitalized form entries, list headings, and plain old first word capitalization have prevented it from being implemented.
One big interest in the community is word sense disambiguation, which would allow the culling of terms based upon the likelihood that they do not properly fit in context. Culling could also be done based upon normal frequency of the term appearing in text. Or you could create an annotation engine that culls based upon some other requirement, such as semantic type. For your two specific examples you can prevent a lot of false positive acronyms and abbreviations by increasing the required character count cutoff for terms. This can be done by setting the uima parameter "minimumSpan" to 5 (getting rid of "AIDS" but keeping "APSGN"). You can do this using the old xml style or uimafit, something like AnalysisEngineFactory.createEngineDescription( DefaultJCasTermAnnotator.class, JCasTermAnnotator.PARAM_MIN_SPAN_KEY, 3 ) Sean -----Original Message----- From: Tomasz Oliwa [mailto:ol...@uchicago.edu] Sent: Wednesday, June 01, 2016 11:28 AM To: dev@ctakes.apache.org Subject: cTAKES false positives, case-insensitivity Hi, I have encountered false positives annotated with cTAKES that seem to come from case-insensitivity of the annotation lookup, such as: Pt uses hearing aids. -> "aids" is found as DiseaseDisorderMention cui=C0001175, Acquired Immunodeficiency Syndrome Pt values are all stable. -> "all" is found as DiseaseDisorderMention cui=C1961102, Precursor Cell Lymphoblastic Leukemia Lymphoma" Are there ways in cTAKES to approach or to resolve such issues? How do you deal with such false positives, so that they are not matched? Regards, Tomasz