Hello,
I’ve run into a problem and a question when running cTAKES. If I have a 
document and process it through cTAKES, then the XMI output will contain 
numerous XML tags. The tags our lab is interested in are the CUIs, for example, 
the XMI tag

<refsem:UmlsConcept xmi:id="16626" codingScheme="SNOMEDCT_US" code="7092007" 
score="0.0" disambiguated="false" cui="C0025859" tui="T109" 
preferredText="Metoprolol-containing product"/>

Would indicate the CUI C0025859 for Metoprolol-containing product is found in a 
given document.

If I look at the input document text, then I can locate three instances of the 
drug Metoprolol in the document text. When I look at the cTAKES XMI output in 
the cTAKES XMI CVD viewer, each of the results for Metoprolol is part of 
ontologyConceptArr, with 4 members each, looking like this:

// found at org.apache.ctakes.typesystem.type.textsem.EventMention
//       org.apache.ctakes.typesystem.type.textsem.MedicationMention
//           ontologyConceptArr = uima.cas.FSArray[4]

<refsem:UmlsConcept xmi:id="16626" codingScheme="SNOMEDCT_US" code="7092007" 
score="0.0" disambiguated="false" cui="C0025859" tui="T109" 
preferredText="Metoprolol-containing product"/>
<refsem:UmlsConcept xmi:id="16646" codingScheme="SNOMEDCT_US" code="7092007" 
score="0.0" disambiguated="false" cui="C0025859" tui="T121" 
preferredText="Metoprolol-containing product"/>
<refsem:UmlsConcept xmi:id="16616" codingScheme="SNOMEDCT_US" code="372826007" 
score="0.0" disambiguated="false" cui="C0025859" tui="T109" 
preferredText="Metoprolol-containing product"/>
<refsem:UmlsConcept xmi:id="16636" codingScheme="SNOMEDCT_US" code="372826007" 
score="0.0" disambiguated="false" cui="C0025859" tui="T121" 
preferredText="Metoprolol-containing product"/>

Although not shown here, it is possible for there to be different CUIs within a 
single uima.cas.FSArray, with this array mapping to a single string of text in 
the document.

If I walk the XMI file and retrieve all CUIs, then the result will be the CUI 
C0025859 being found 12 times, however, if I extend the JCasAnnotator_ImplBase 
java class to extract the CUIs from the jCas annotations, then it only finds 
this CUI 3 times.

If part of the output needs to include a count of all CUIs found by cTAKES 
within a given document, which method is correct?

Thanks!


John Caskey, PhD
Senior Data Scientist
Department of Medicine
University of Wisconsin-Madison


Reply via email to