Re: Configure Fast Lookup Dictionary To Return Only 1 UMLS Code (CUI) [EXTERNAL]

Ryan Young Mon, 23 Mar 2020 12:29:18 -0700

Hello Sean,

Wow! This was a lot more than I was anticipating! Thank you very much!


To answer your questions...
• I am using Windows 10
• I have the Python script call a shell command to run a batch file. The
batch file just contains the following line:
"C:\cTAKES_4.0.0\bin\runPiperFile.bat"  -p "C:\path\to\piper.piper"
• The Python script waits for the shell command to complete (i.e., when
cTAKES is finished processing)
• The Python script will then parse the output text files and then continue
on with the code

Prior to calling cTAKES, the surgery list is in a Pandas dataframe. The
workaround I had created was to save each line of the surgery list column
in the dataframe to a different text file to make it easier for when I had
to parse the output cTAKES text file. As I had mentioned previously, I
would like to have just 1 input text file and 1 output text file (as long
as the output file can be easily parsed by Python).

Regarding my coding background, I don't have much background in Java.
However, a few years ago, I had no knowledge of Python either, but I was
able to teach myself while in medical school.

A few more questions for you...
1.) Should I save the code you posted in the following location as a .jar
file?
C:\cTAKES_4.0.0\lib\SentenceFirstCuiWriter.jar

2.) Should I replace "add CuiLookupLister" with "add
SentenceFirstCuiWriter" in the piper file or do I need both?

3.) If the SentenceFirstCuiWriter is unable to find a valid CUI, will it
leave a blank, N/A, or NaN value? Having any of these values would
definitely help when I have Python parse the output text file. When I have
Python read the output text file, I would have it delete any dataframe rows
with NaN or N/A in the CUI column.

Thank you very much for your assistance!

Ryan Young
MD/MBA Candidate
Jacobs School of Medicine & Biomedical Sciences

On Mon, Mar 23, 2020 at 1:01 PM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Ryan,
>
> Here is some code for a writer that will do what you want.
> To use it, get rid of those first two lines in the piper that I sent (set,
> reader).
> The default reader will work just fine, and it will allow you to process
> multiple surgery lists in on run.
>
> Then just add SentenceFirstCuiWriter to the end of your piper.
>
> Sean
>
>
> public class SentenceFirstCuiWriter extends AbstractJCasFileWriter {
>
>    public void writeFile( final JCas jCas, final String outputDir,
>                           final String documentId, final String fileName )
> throws IOException {
>       File cuiFile = new File( outputDir, fileName + "_cui.txt" );
>       Map<Sentence, Collection<ProcedureMention>> sentenceMap
>             = JCasUtil.indexCovered( jCas, Sentence.class,
> ProcedureMention.class );
>       List<Collection<ProcedureMention>> sortedSentenceProcedures
>             = sentenceMap.entrySet()
>                          .stream()
>                          .sorted( Map.Entry.comparingByKey(
> DefaultAspanComparator.INSTANCE ) )
>                          .map( Map.Entry::getValue )
>                          .collect( Collectors.toList() );
>       try ( Writer writer = new BufferedWriter( new FileWriter( cuiFile )
> ) ) {
>          for ( Collection<ProcedureMention> procedures :
> sortedSentenceProcedures ) {
>             ProcedureMention firstProcedure
>                   = procedures.stream()
>                               .min( Comparator.comparingInt(
> ProcedureMention::getBegin ) )
>                               .orElse( null );
>             if ( firstProcedure != null ) {
>                String cui
>                      = OntologyConceptUtil.getCuis( firstProcedure )
>                                           .stream()
>                                           .findFirst()
>                                           .orElse( "" );
>                if ( !cui.isEmpty() ) {
>                   writer.write( cui + "\n" );
>                }
>             }
>          }
>       }
>    }
> }
>
> ________________________________________
> From: Ryan Young <royo...@buffalo.edu>
> Sent: Monday, March 23, 2020 11:02 AM
> To: dev@ctakes.apache.org
> Subject: Configure Fast Lookup Dictionary To Return Only 1 UMLS Code (CUI)
> [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hello,
>
> I am a medical student who happened to come across cTAKES for a project I
> am working on. What I would like to do is take a list of surgeries in a
> text file and have cTAKES output what it determines to be the best UMLS
> code (CUI) for that particular line.
>
> Each line of the text file is independent of the others (i.e., each line
> should be read and analyzed separately). For example, here's my list of the
> surgeries (Surgery_List.txt):
> Colonoscopy with Polypectomy
> Esophagogastroduodenoscopy Colonoscopy
> Esophagogastroduodenoscopy with Endoscopic ultrasound Fine needle
> aspiration
>
> When I run the piper file (see below), I get the following output:
> Colonoscopy with Polypectomy
> "Colonoscopy"
>   Procedure
>   C0009378 colonoscopy
> "Polypectomy"
>   Procedure
>   C0521210 Resection of polyp
>
> Esophagogastroduodenoscopy Colonoscopy
> "Esophagogastroduodenoscopy"
>   Procedure
>   C0079304 Esophagogastroduodenoscopy
> "Colonoscopy"
>   Procedure
>   C0009378 colonoscopy
>
> Esophagogastroduodenoscopy with Endoscopic ultrasound Fine needle
> aspiration
> "Esophagogastroduodenoscopy"
>   Procedure
>   C0079304 Esophagogastroduodenoscopy
> "Endoscopic ultrasound"
>   Procedure
>   C0376443 Endoscopic Ultrasound
> "Endoscopic"
>   Procedure
>   C0014245 Endoscopy (procedure)
> "ultrasound"
>   Procedure
>   C0041618 Ultrasonography
> "Fine needle aspiration"
>   Procedure
>   C1510483 Fine needle aspiration biopsy
> "aspiration"
>   Procedure
>   C0349707 Aspiration-action
>
> Here's the piper file I have been using:
> reader org.apache.ctakes.core.cr.FileTreeReader
> InputDirectory="C:\path\to\input\folder"
> load DefaultTokenizerPipeline.piper
>
> SentenceModelFile=C:\cTAKES_4.0.0\desc\ctakes-core\desc\analysis_engine\SentenceDetectorAnnotatorBIO.xml
> add ContextDependentTokenizerAnnotator
> add org.apache.ctakes.necontexts.ContextAnnotator
> addDescription POSTagger
> load ChunkerSubPipe.piper
> set ctakes.umlsuser=my_username ctakes.umlspw=my_password
> add org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator
>
> DictionaryDescriptor=C:\cTAKES_4.0.0\desc\ctakes-dictionary-lookup-fast\desc\analysis_engine\UmlsLookupAnnotator.xml
>
> LookupXml=C:\cTAKES_4.0.0\resources\org\apache\ctakes\dictionary\lookup\fast\sno_rx_16ab.xml
> add property.plaintext.PropertyTextWriterFit
> OutputDirectory="C:\path\to\output\folder"
>
> The workaround I have developed is as follows...
> 1.) Save each line of Surgery_List.txt to separate text files
> 2.) Use a Python script to parse each individual text file to extract the
> first UMLS code (CUI) given in the text file
>
> The above method works fine when there's only 10 lines, but not so well
> when there's 40,000 lines in Surgery_List.txt.
>
> Ideally, I would like for Fast Dictionary Lookup to just return the top
> result for each line of Surgery_List.txt. For example, Output.txt would
> look just like this:
> C0009378
> C0079304
> C0079304
>
> Just for reference here's how UMLS codes correspond between
> Surgery_List.txt and Output.txt:
> C0009378 --> Colonoscopy with Polypectomy
> C0079304 --> Esophagogastroduodenoscopy Colonoscopy
> C0079304 --> Esophagogastroduodenoscopy with Endoscopic ultrasound Fine
> needle aspiration
>
> Is there something I can add to the piper file to make this happen?
>
> Currently, I have the cTAKES user version installed, but I could install
> the developer version if need be. I would just need a little guidance on
> which Java script I would need to modify to get the desired results.
>
> Thank You,
>
> Ryan Young
> MD/MBA Candidate
> Jacobs School of Medicine & Biomedical Sciences
>

Re: Configure Fast Lookup Dictionary To Return Only 1 UMLS Code (CUI) [EXTERNAL]

Reply via email to