Hello, I’ve run into a problem and a question when running cTAKES. If I have a document and process it through cTAKES, then the XMI output will contain numerous XML tags. The tags our lab is interested in are the CUIs, for example, the XMI tag
<refsem:UmlsConcept xmi:id="16626" codingScheme="SNOMEDCT_US" code="7092007" score="0.0" disambiguated="false" cui="C0025859" tui="T109" preferredText="Metoprolol-containing product"/> Would indicate the CUI C0025859 for Metoprolol-containing product is found in a given document. If I look at the input document text, then I can locate three instances of the drug Metoprolol in the document text. When I look at the cTAKES XMI output in the cTAKES XMI CVD viewer, each of the results for Metoprolol is part of ontologyConceptArr, with 4 members each, looking like this: // found at org.apache.ctakes.typesystem.type.textsem.EventMention // org.apache.ctakes.typesystem.type.textsem.MedicationMention // ontologyConceptArr = uima.cas.FSArray[4] <refsem:UmlsConcept xmi:id="16626" codingScheme="SNOMEDCT_US" code="7092007" score="0.0" disambiguated="false" cui="C0025859" tui="T109" preferredText="Metoprolol-containing product"/> <refsem:UmlsConcept xmi:id="16646" codingScheme="SNOMEDCT_US" code="7092007" score="0.0" disambiguated="false" cui="C0025859" tui="T121" preferredText="Metoprolol-containing product"/> <refsem:UmlsConcept xmi:id="16616" codingScheme="SNOMEDCT_US" code="372826007" score="0.0" disambiguated="false" cui="C0025859" tui="T109" preferredText="Metoprolol-containing product"/> <refsem:UmlsConcept xmi:id="16636" codingScheme="SNOMEDCT_US" code="372826007" score="0.0" disambiguated="false" cui="C0025859" tui="T121" preferredText="Metoprolol-containing product"/> Although not shown here, it is possible for there to be different CUIs within a single uima.cas.FSArray, with this array mapping to a single string of text in the document. If I walk the XMI file and retrieve all CUIs, then the result will be the CUI C0025859 being found 12 times, however, if I extend the JCasAnnotator_ImplBase java class to extract the CUIs from the jCas annotations, then it only finds this CUI 3 times. If part of the output needs to include a count of all CUIs found by cTAKES within a given document, which method is correct? Thanks! John Caskey, PhD Senior Data Scientist Department of Medicine University of Wisconsin-Madison