Re: cTAKES questions [EXTERNAL]

2023-04-12 Thread Finan, Sean
Hi John,

Good questions.  Unfortunately, I can't really say what is going on as it seems 
that a lot of the information is in your images - 1000 words and all that.
Unfortunately, attachments and inserted images will not go through the dev@ 
email system.  Please copy/paste some plain text in this thread and we will try 
to help you.

The first "NOCODE" item might come from a table name mismatch in the database, 
e.g. "ICD-9" vs. "ICD_9", but that is a shot in the dark.

The second issue that you report is more concerning.  You are correct in that 
it is unexpected and most likely not a great thing to have happening.

Just in case it makes things easier, you can use another method for getting 
cuis.  For instance, add the SemanticTableFileWriter to the end of your 
pipeline.  It will write one file per note and accepts standard fileWriter 
parameter "SubDirectory", plus values for parameter "TableType": BSV, CSV, 
HTML, TAB.

Sean


From: JOHN R CASKEY 
Sent: Tuesday, April 11, 2023 11:45 PM
To: dev@ctakes.apache.org 
Subject: cTAKES questions [EXTERNAL]

* External Email - Caution *


Hello,

I have a minor bug to report, and a question that may be a part of a major bug.



If I create a custom dictionary with multiple vocabularies and then run cTAKES 
using this custom dictionary, cTAKES will sometimes replace the vocabulary name 
with the name of the custom dictionary. An example is shown in the attached 
image1.png that was run on the MIMIC dataset. I noticed that if I looked up the 
CUI C1548802 in the UMLS Metathesaurus Browser that had the incorrect 
vocabulary name inserted, it had ‘NOCODE’ for the code. This only seemed to 
occur with CUIs from the MTH vocabulary. Is this something that can be fixed 
within cTAKES?



The question and maybe major bug was we ran the same dataset (50 MIMIC notes) 
twice: once on the custom dictionary with multiple vocabularies described in 
the attached image1.png, and then using a custom dictionary that only included 
the snomed vocabulary. Next, we filtered the output from the multiple 
vocabulary dictionary to only include CUIs that were reported by snomed. The 
two outputs from cTAKES should have produced the same CUIs, but as can be seen 
in the attached Venn Diagrams, some of the CUIs reported by cTAKES running the 
snomed-only dictionary were not reported by cTAKES running the multiple 
vocabulary dictionary. Do you know why the two outputs would be different?



We’re running user installation of cTAKES 4.0.0.1 via



./bin/runPiperFile.sh -p path/to/piperfile -l path/to/custom_dict.xml -i 
inputDir --xmiOut outputDir



And then extracting the CUIs from the output XMI files.



Please let me know if I should report this as an issue on the new GitHub 
repository instead of via email.



Thanks!



John Caskey




cTAKES running slower with each run

2023-04-12 Thread Milinovich, Alex
I'm running ctakes both as an API and as a console app.   Each time I hit 
ctakes, the run time per document is getting incrementally slower by a few 
thousands of a millisecond per xml element (to normalize for different document 
sizes) than the previous document.  Compound this over 1000 documents in 20 
minutes and the runs are going from 0.06 milliseconds per xml element to 1.5 
milliseconds per xml element.  It's a very consistent 0.002 millisecond 
increase in the rate for each subsequent document I throw at cTAKES.

Is there any caching or garbage collection or something I should be on the 
lookout to adjust or fix?

Thanks

~Alex


[cid:image001.jpg@01D96D43.69FD3AA0]
Alex Milinovich
Director of Research - Data Science Analytics  |  Quantitative Health Sciences
9500 Euclid Ave. - JJN3 | Cleveland, OH 44195 | m: (216) 245-7655


Please consider the environment before printing this e-mail

Cleveland Clinic is currently ranked as the No. 2 hospital in the country by 
U.S. News & World Report (2017-2018). Visit us online at 
http://www.clevelandclinic.org for a complete listing of our services, staff 
and locations. Confidentiality Note: This message is intended for use only by 
the individual or entity to which it is addressed and may contain information 
that is privileged, confidential, and exempt from disclosure under applicable 
law. If the reader of this message is not the intended recipient or the 
employee or agent responsible for delivering the message to the intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and destroy the 
material in its entirety, whether electronic or hard copy. Thank you.


Re: cTAKES running slower with each run

2023-04-12 Thread Peter Abramowitsch
There are many ways to package ctakes, and admittedly ours is unlike the
console app and we have our own multithreaded API, but we regularly do
millions of documents at a time and haven't seen this issue.  The core
application with  our fairly standard pipeline is up for a month at a time
with no degradation

Are you using any unusual or deprecated annotators?  It could be that one
of the less used ones doesn't separate initialization properly from it's
processing method and is caching something that it shouldn't..   Are you
seeing a concomitant growth in memory footprint?

Try running it under jvisualvm it may give you a clue.

Peter

On Wed, Apr 12, 2023 at 7:41 PM Milinovich, Alex  wrote:

> I’m running ctakes both as an API and as a console app.   Each time I hit
> ctakes, the run time per document is getting incrementally slower by a few
> thousands of a millisecond per xml element (to normalize for different
> document sizes) than the previous document.  Compound this over 1000
> documents in 20 minutes and the runs are going from 0.06 milliseconds per
> xml element to 1.5 milliseconds per xml element.  It’s a very consistent
> 0.002 millisecond increase in the rate for each subsequent document I throw
> at cTAKES.
>
>
>
> Is there any caching or garbage collection or something I should be on the
> lookout to adjust or fix?
>
>
>
> Thanks
>
>
>
> ~Alex
>
>
>
>
> *Alex Milinovich*
>
> Director of Research – Data Science Analytics  |  Quantitative Health
> Sciences
> 9500 Euclid Ave. – JJN3 | Cleveland, OH 44195 | m: (216) 245-7655
>
>
>
>
>
> Please consider the environment before printing this e-mail
> Cleveland Clinic is currently ranked as one of the nation’s top hospitals
> by *U.S. News & World Report* (2022-2023). Visit us online at
> http://www.clevelandclinic.org for a complete listing of our services,
> staff and locations. Confidentiality Note: This message is intended for use
> only by the individual or entity to which it is addressed and may contain
> information that is privileged, confidential, and exempt from disclosure
> under applicable law. If the reader of this message is not the intended
> recipient or the employee or agent responsible for delivering the message
> to the intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited. If
> you have received this communication in error, please contact the sender
> immediately and destroy the material in its entirety, whether electronic or
> hard copy. Thank you.
>