Hello Sean, Wow! This was a lot more than I was anticipating! Thank you very much!
To answer your questions... • I am using Windows 10 • I have the Python script call a shell command to run a batch file. The batch file just contains the following line: "C:\cTAKES_4.0.0\bin\runPiperFile.bat" -p "C:\path\to\piper.piper" • The Python script waits for the shell command to complete (i.e., when cTAKES is finished processing) • The Python script will then parse the output text files and then continue on with the code Prior to calling cTAKES, the surgery list is in a Pandas dataframe. The workaround I had created was to save each line of the surgery list column in the dataframe to a different text file to make it easier for when I had to parse the output cTAKES text file. As I had mentioned previously, I would like to have just 1 input text file and 1 output text file (as long as the output file can be easily parsed by Python). Regarding my coding background, I don't have much background in Java. However, a few years ago, I had no knowledge of Python either, but I was able to teach myself while in medical school. A few more questions for you... 1.) Should I save the code you posted in the following location as a .jar file? C:\cTAKES_4.0.0\lib\SentenceFirstCuiWriter.jar 2.) Should I replace "add CuiLookupLister" with "add SentenceFirstCuiWriter" in the piper file or do I need both? 3.) If the SentenceFirstCuiWriter is unable to find a valid CUI, will it leave a blank, N/A, or NaN value? Having any of these values would definitely help when I have Python parse the output text file. When I have Python read the output text file, I would have it delete any dataframe rows with NaN or N/A in the CUI column. Thank you very much for your assistance! Ryan Young MD/MBA Candidate Jacobs School of Medicine & Biomedical Sciences On Mon, Mar 23, 2020 at 1:01 PM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > Hi Ryan, > > Here is some code for a writer that will do what you want. > To use it, get rid of those first two lines in the piper that I sent (set, > reader). > The default reader will work just fine, and it will allow you to process > multiple surgery lists in on run. > > Then just add SentenceFirstCuiWriter to the end of your piper. > > Sean > > > public class SentenceFirstCuiWriter extends AbstractJCasFileWriter { > > public void writeFile( final JCas jCas, final String outputDir, > final String documentId, final String fileName ) > throws IOException { > File cuiFile = new File( outputDir, fileName + "_cui.txt" ); > Map<Sentence, Collection<ProcedureMention>> sentenceMap > = JCasUtil.indexCovered( jCas, Sentence.class, > ProcedureMention.class ); > List<Collection<ProcedureMention>> sortedSentenceProcedures > = sentenceMap.entrySet() > .stream() > .sorted( Map.Entry.comparingByKey( > DefaultAspanComparator.INSTANCE ) ) > .map( Map.Entry::getValue ) > .collect( Collectors.toList() ); > try ( Writer writer = new BufferedWriter( new FileWriter( cuiFile ) > ) ) { > for ( Collection<ProcedureMention> procedures : > sortedSentenceProcedures ) { > ProcedureMention firstProcedure > = procedures.stream() > .min( Comparator.comparingInt( > ProcedureMention::getBegin ) ) > .orElse( null ); > if ( firstProcedure != null ) { > String cui > = OntologyConceptUtil.getCuis( firstProcedure ) > .stream() > .findFirst() > .orElse( "" ); > if ( !cui.isEmpty() ) { > writer.write( cui + "\n" ); > } > } > } > } > } > } > > ________________________________________ > From: Ryan Young <royo...@buffalo.edu> > Sent: Monday, March 23, 2020 11:02 AM > To: dev@ctakes.apache.org > Subject: Configure Fast Lookup Dictionary To Return Only 1 UMLS Code (CUI) > [EXTERNAL] > > * External Email - Caution * > > > Hello, > > I am a medical student who happened to come across cTAKES for a project I > am working on. What I would like to do is take a list of surgeries in a > text file and have cTAKES output what it determines to be the best UMLS > code (CUI) for that particular line. > > Each line of the text file is independent of the others (i.e., each line > should be read and analyzed separately). For example, here's my list of the > surgeries (Surgery_List.txt): > Colonoscopy with Polypectomy > Esophagogastroduodenoscopy Colonoscopy > Esophagogastroduodenoscopy with Endoscopic ultrasound Fine needle > aspiration > > When I run the piper file (see below), I get the following output: > Colonoscopy with Polypectomy > "Colonoscopy" > Procedure > C0009378 colonoscopy > "Polypectomy" > Procedure > C0521210 Resection of polyp > > Esophagogastroduodenoscopy Colonoscopy > "Esophagogastroduodenoscopy" > Procedure > C0079304 Esophagogastroduodenoscopy > "Colonoscopy" > Procedure > C0009378 colonoscopy > > Esophagogastroduodenoscopy with Endoscopic ultrasound Fine needle > aspiration > "Esophagogastroduodenoscopy" > Procedure > C0079304 Esophagogastroduodenoscopy > "Endoscopic ultrasound" > Procedure > C0376443 Endoscopic Ultrasound > "Endoscopic" > Procedure > C0014245 Endoscopy (procedure) > "ultrasound" > Procedure > C0041618 Ultrasonography > "Fine needle aspiration" > Procedure > C1510483 Fine needle aspiration biopsy > "aspiration" > Procedure > C0349707 Aspiration-action > > Here's the piper file I have been using: > reader org.apache.ctakes.core.cr.FileTreeReader > InputDirectory="C:\path\to\input\folder" > load DefaultTokenizerPipeline.piper > > SentenceModelFile=C:\cTAKES_4.0.0\desc\ctakes-core\desc\analysis_engine\SentenceDetectorAnnotatorBIO.xml > add ContextDependentTokenizerAnnotator > add org.apache.ctakes.necontexts.ContextAnnotator > addDescription POSTagger > load ChunkerSubPipe.piper > set ctakes.umlsuser=my_username ctakes.umlspw=my_password > add org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator > > DictionaryDescriptor=C:\cTAKES_4.0.0\desc\ctakes-dictionary-lookup-fast\desc\analysis_engine\UmlsLookupAnnotator.xml > > LookupXml=C:\cTAKES_4.0.0\resources\org\apache\ctakes\dictionary\lookup\fast\sno_rx_16ab.xml > add property.plaintext.PropertyTextWriterFit > OutputDirectory="C:\path\to\output\folder" > > The workaround I have developed is as follows... > 1.) Save each line of Surgery_List.txt to separate text files > 2.) Use a Python script to parse each individual text file to extract the > first UMLS code (CUI) given in the text file > > The above method works fine when there's only 10 lines, but not so well > when there's 40,000 lines in Surgery_List.txt. > > Ideally, I would like for Fast Dictionary Lookup to just return the top > result for each line of Surgery_List.txt. For example, Output.txt would > look just like this: > C0009378 > C0079304 > C0079304 > > Just for reference here's how UMLS codes correspond between > Surgery_List.txt and Output.txt: > C0009378 --> Colonoscopy with Polypectomy > C0079304 --> Esophagogastroduodenoscopy Colonoscopy > C0079304 --> Esophagogastroduodenoscopy with Endoscopic ultrasound Fine > needle aspiration > > Is there something I can add to the piper file to make this happen? > > Currently, I have the cTAKES user version installed, but I could install > the developer version if need be. I would just need a little guidance on > which Java script I would need to modify to get the desired results. > > Thank You, > > Ryan Young > MD/MBA Candidate > Jacobs School of Medicine & Biomedical Sciences >