Hi Tomasz, The change to lowercase is also done in the dictionary code. Unless you want to make a database for the previous dictionary lookup module (it looks like you don't), you shouldn't bother with the old dictionarytool.jar Use the newer dictionary-gui in sandbox instead. The class there is org.apache.ctakes.dictionary.creator.util.TextTokenizer In the getTokenizedText(..) method, line 177, just remove the .toLowerCase()
In the ctakes -fast module code you will need to replace the ...dictionary.lookup2.util.FastLookuptoken and remove the .toLowerCase() from the constructor method, line 45. You cannot extend that class as it is immutable. Sean -----Original Message----- From: Tomasz Oliwa [mailto:ol...@uchicago.edu] Sent: Wednesday, June 01, 2016 3:20 PM To: dev@ctakes.apache.org Subject: RE: cTAKES false positives, case-insensitivity Another idea would be to create the dictionary without lowercasing the concept text and rare word in CUI_TERMS, but keep them as they are from the UMLS. Do you happen to know which class / line is responsible for the lowercasing in the dictionarytool.jar ? I could like to try this. Regards, Tomasz ________________________________________ From: Tomasz Oliwa [ol...@uchicago.edu] Sent: Wednesday, June 01, 2016 11:07 AM To: dev@ctakes.apache.org Subject: RE: cTAKES false positives, case-insensitivity Thank you all for the suggestions. Sean, by "make the AE case-sensitive" do you mean writing an annotator that simply removes an annotation based on some criteria like case and semantic type? Or does cTAKES have such a switch already available? ________________________________________ From: Finan, Sean [sean.fi...@childrens.harvard.edu] Sent: Wednesday, June 01, 2016 10:56 AM To: dev@ctakes.apache.org Subject: RE: cTAKES false positives, case-insensitivity Oh - I should mention: Increasing the minimum required span cause have unwanted false negatives. A minimum of 5 will get rid of things like "arm" and "foot". You could make your own AE that changes this by getting rid of only disease/disorder with character count < 5 . That would probably be better. Also maybe meds with count < 5. You can even make the AE case-sensitive in case that helps. Sean -----Original Message----- From: Tomasz Oliwa [mailto:ol...@uchicago.edu] Sent: Wednesday, June 01, 2016 11:28 AM To: dev@ctakes.apache.org Subject: cTAKES false positives, case-insensitivity Hi, I have encountered false positives annotated with cTAKES that seem to come from case-insensitivity of the annotation lookup, such as: Pt uses hearing aids. -> "aids" is found as DiseaseDisorderMention cui=C0001175, Acquired Immunodeficiency Syndrome Pt values are all stable. -> "all" is found as DiseaseDisorderMention cui=C1961102, Precursor Cell Lymphoblastic Leukemia Lymphoma" Are there ways in cTAKES to approach or to resolve such issues? How do you deal with such false positives, so that they are not matched? Regards, Tomasz