Paul asks: > I am looking for a NLP to read pathology reports and extract cancer > related site, histology, stage and any other DX/RX data available. In > looking at CTakes, I have a few questions; > > - Is CTakes an appropriate tool to automate this task?
I wrote a commercial surgical-pathology coding module some years ago, and could imagine doing it in cTAKES. Here's my two cents to add to the wealth of information Peter has already provided. Best luck. > Where can I find an "executive overview" (30,000 foot view) of how the CTakes works? As Peter said, there's a lot of documentation out there! Videos here: https://ctakes.apache.org/tutorials.html Key point: it's built on top of UIMA https://uima.apache.org/ which ingests and annotates data from any source, letting you mix, match and create your own annotators to build chains of analyses. The cTAKES value-adds include a clinical type system and a spiffy dictionary (see below). > My ignorance regarding NLP algorithms like CTakes is whether it is keyword driven, or it is self learning. cTAKES is *not* "self-learning"; you have to tell it exactly what information you want to extract from where. Pro: High precision; explainable; you won't get the right answer for the wrong reason. Con: Low recall; brittle; you may not get answers at all! If you're processing unpredictable document formats from many different facilities, it can be hard to generalize over them. > I currently have a homegrown application which looks for keywords and negation modifiers within a certain distance from the keywords cTAKES can certainly help with that. - *Keywords *cTAKES lets you use the NLM's UMLS Metathesaurus <https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html>, using the dictionary framework Peter mentioned: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Fast+Dictionary+Lookup These sources may be useful in building your custom dictionary: - the NCI Thesaurus: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NCI/index.html - CPT, if you want codes from there: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CPT/index.html - For anatomy, I'm not familiar with the "anatomical site annotator" Peter alludes to, but the FMA is better structured than SNOMED: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/FMA/index.html - *Negation* Several annotators available: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Negation+Annotators Distance-from-keywords is a start, but sentence detection and shallow parsing both help. I like the ctakes-ytex-uima NegexAnnotator and SentenceDetector. - *Document structure *I found header detection to be crucial in processing pathology reports: tracking specimens through a document, extracting tumor info from tables, etc. The cTAKES RegexSectionizer might work for you. https://ctakes.apache.org/apidocs/4.0.0/org/apache/ctakes/core/ae/RegexSectionizer.html _____________________________________________________ *Kean Kaufmann* NLP Architect RecordsOne nSight Driven | *Priority. Clarity. Integrity. * On Thu, Aug 10, 2023 at 1:06 PM Paul Stearns <pa...@compuace.com.invalid> wrote: > I am looking for a NLP to read pathology reports and extract cancer > related site, histology, stage and any other DX/RX data available. In > looking at CTakes, I have a few questions; > > - Is CTakes an appropriate tool to automate this task? > - The end goal would be a fully automated tool where text was presented to > an API and data was returned. > - An added bonus, would be for the tool to annotate the text, so that a > reviewer can more easily find the relevant data. > - For someone with a strong IT/software development background, but no NLP > background what is the level of difficulty in getting started with this > product? > > Paul R. Stearns > Advanced Consulting Enterprises, Inc. > 15150 NW 79th Court, > Suite: 206 > Miami Lakes Fl, 33016 > > Voice: (305)623-0360 x107 > Fax: (305)623-4588 >