Re: periods and the interaction with PTB & Fast Dict Lookup.

2015-07-17 Thread britt fitch
Hi Sean, do you want a ticket for the PTB update?

Cheers,

Britt



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 15, 2015, at 9:07 AM, britt fitch  
> wrote:
> 
> Thanks Sean.
> 
> The other part of the concern is if its reasonable/feasible to alter 
> tokenization rules for things like gene locations. I can work around this in 
> a few ways but if there are other examples of how this might come up in other 
> cases it could be worth looking at a blanket change. Sadly I don’t have 
> another example off the top of my head, maybe organism names? Doing a few 
> queries for terms in the UMLS with periods the majority of them seem to be 
> things you really would want to split on. Perhaps genes are just an edge case.
> 
> I was looking at gene locations overall, not any particular gene or disorder 
> grouping. The term I mentioned was just meant to be an example.
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com 
> britt.fi...@wiredinformatics.com
> 
>> On Jul 15, 2015, at 8:57 AM, Finan, Sean > > wrote:
>> 
>> Hi Britt,
>> 
>> The dictionary should be using ptb tokenization, but I obviously missed a 
>> rule and separated the . from the following 2 in the dictionary.
>> 
>> I will double-check everything.
>> 
>> Sean
>> 
>> p.s. if you don’t mind my asking, are you looking into all connective tissue 
>> disorders or just Shprintzen?
>> 
>> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
>> ]
>> Sent: Tuesday, July 14, 2015 3:58 PM
>> To: dev@ctakes.apache.org 
>> Subject: periods and the interaction with PTB & Fast Dict Lookup.
>> 
>> Another question/topic likely for Sean & Tim. Happy to get others’ feedback 
>> as well.
>> 
>> I am trying to identify gene related information.
>> 
>> It appears that the PTB tokenization logic in places like the tokenizer & 
>> dictionary building will split a string into multiple tokens if it is not a 
>> number and contains a period.
>> 
>> For example, given “22q11.2 deletion syndrome”:
>> 
>> PTB tokenizer: [22q11, .2, deletion, syndrome]
>> POS for the above term: [CD, CD, NN, NN]
>> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
>> 
>> The same string creates a different split of [22q11, ., 2, deletion, 
>> syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
>> When the _rareWordTermMap gets created it uses the first token as the key: 
>> 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
>> 
>> The period-split difference above (period alone vs period + number) might be 
>> irrelevant here because for the input “22q11.2 deletion syndrome”, the 
>> lookup indices are [2,3].
>> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” 
>> because its a number.
>> 
>> It looks like this concept might not be possible to be identified unless CD 
>> is allowed as a lookup token POS.
>> Even if this is allowed though, in the case of gene locations I think the 
>> PTB rules might not be the best fit.
>> 
>> Are there any thoughts/experiences regarding addressing the gene location 
>> mentions like this?
>> Should the Fast Dict tokenization logic match the PTB tokenizer logic to 
>> produce the same components?
>> 
>> Let me know if I read into one of these points wrong. Since these items 
>> would likely cause large changes I am looking to get some feedback before 
>> moving forward.
>> 
>> Cheers,
>> 
>> Britt
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com 
>> britt.fi...@wiredinformatics.com 
>> >  >
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


How to Add the resources as a folder to the classpath? - Compilereleasefromcommandline.

2015-07-17 Thread Justin Zhang
Could anybody help some information how to add resources folder to the
classpath? Thanks,

https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+Developer+Install+Guide#cTAKES3.0DeveloperInstallGuide-Compileareleasefromcommandline


5. Add the resources as a folder to the classpath.
Make sure the current path or dot (.) is in your CLASSPATH environment
variable accessible to the process maven is running in.

-- 
Justin


RE: Allergy Annotator

2015-07-17 Thread Tomasz Oliwa
Hi,

I am interested in the design decision of the sentence detector. 

Why does it split a sentence of the form "WORD1: WORD2 WORD3." into two 
sentences  "WORD1:" and "WORD2 WORD3."? Do other components of cTAKES require 
such a sentence splitting?

It would seem to me that it should remain one sentence. For example, the 
smoking status detector has its own SentenceAdjuster that merges some of such 
sentences back into one, because of this design.

Thanks,
Tomasz


From: Finan, Sean [sean.fi...@childrens.harvard.edu]
Sent: Friday, July 10, 2015 3:20 PM
To: dev@ctakes.apache.org
Subject: RE: Allergy Annotator

Hi Tom,

It is exactly because the sentence detector splits "KEY:" from "VALUE" that I 
didn't suggest using sentences.  Instead, I would just iterate over the whole 
cas collection of medication events and attempt to match allergy phrases  
("allergic to medication") with text the note spanning from event.begin-15 to 
event.end+15 or whatever window size you prefer.

Sean

-Original Message-
From: Tom Devel [mailto:deve...@gmail.com]
Sent: Friday, July 10, 2015 4:12 PM
To: dev@ctakes.apache.org
Subject: Re: Allergy Annotator

Sean and Dima, these are great suggestions, thanks so far.

Sean, when looping over medication events as you say, I can see how it is 
possible to take the textspan.Sentence of this MedicationMention, and then do a 
regex check for the phrase structure as Dima said.

But instead of textspan.Sentence, you mention "see any is included in a 
phrase". What cTAKES/UIMA class is related to this?

Because if I would use textspan.Sentence, it would work for "The patient is 
allergic to penicillin.", but cTAKES splits "ALLERGIES:  PENICILLIN, WHEAT"
into two sentences, so that the MedicationMentions here would not be in the 
same sentence as the word "ALLERGIES".

Thanks again,
Tom


On Fri, Jul 10, 2015 at 2:12 PM, Finan, Sean < 
sean.fi...@childrens.harvard.edu> wrote:

> Hi Dima, Tom,
>
> I was thinking the same as Dima's first solution.  Iterate through the
> medication events and see any is included in a phrase as mentioned in
> Tom's original email.  Each phrase structure would have to be
> specified beforehand.  However, assigning appropriate CUIs would
> require having a lookup table for each medication allergy.  I think
> that would be the simplest solution.
>
> Sean
>
> -Original Message-
> From: Dligach, Dmitriy [mailto:dmitriy.dlig...@childrens.harvard.edu]
> Sent: Friday, July 10, 2015 2:50 PM
> To: cTAKES Developer list
> Subject: Re: Allergy Annotator
>
> Hi Tom,
>
> If the patters are pretty simple, you could just add a few rules on
> top of the cTAKES dictionary lookup output. Something of the kind
> "allergic to " or "allergies: ,
> , , ...".
>
> If these patterns are hard to express as rules, you should consider a
> machine learning based sequence labeling route (e.g. something similar
> to the cTAKES chunker).
>
>
> Dima
>
> --
> Dmitriy (Dima) Dligach, Ph.D.
> Boston Children's Hospital and Harvard Medical School
> (617) 651-0397
>
>
>
> On Jul 10, 2015, at 13:40, Tom Devel  deve...@gmail.com>> wrote:
>
> Sean,
>
> It would be a wider net, such that if an allergy is mentioned in the
> clinical note, this is captured in the corresponding
> IdentifiedAnnotation (or alternatively, if the IdentifiedAnnotation
> class should not be changed with a new attribute, in a separate allergy 
> annotation).
>
> This annotator would then have to of course run after the clinical
> pipeline has run and discovered all IdentifiedAnnotations.
>
> I am familiar with writing UIMA/cTAKES annotators, but not sure how a
> new ML method could be integrated here for detecting allergies. Do you
> have any thoughts about how to approach this in general?
>
> Thanks,
> Tom
>
> On Fri, Jul 10, 2015 at 11:54 AM, Finan, Sean <
> sean.fi...@childrens.harvard.edu du>>
> wrote:
>
> Hi Tom,
>
> Are you interested in catching all allergies or just a few specific
> allergies for a study?  If you are only concerned with a few then
> there is a (possibly) simple solution.  If you are interested in
> throwing a wider net then I think that a new module would need to be
> created; does anybody reading this have an ML or regex style module?
>
> Sean
>
> -Original Message-
> From: Tom Devel [mailto:deve...@gmail.com]
> Sent: Friday, July 10, 2015 12:42 PM
> To: dev@ctakes.apache.org
> Subject: Allergy Annotator
>
> Hi,
>
> I would like to use/extend cTAKES to detect allergies.
>
> In the cTAKES publication (2010)
>
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ncbi.nlm.nih.g
> ov_pmc_articles_PMC2995668_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZM
> SdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ZApJmGKjz
> vFfNco5rRFVwSIyxmg4MRsxakfuXHbMZME&s=mGWu0XBCJqG2MI5qPlwIpGbQL5IYe7t5E
> WcvhPYW7Lo&e=
> there is the mention
> that: "Allergies to a given medicati

Resources for current Version

2015-07-17 Thread Braun, Florian A
Hello,

I am trying to set up the source code and am following the developer install 
guide. When I get to the part about the resources, which resources should I 
using? The 3.2.0 or the 3.2.1.1-bin? I would think the -bin ones are compiled?

Also, I am getting an xml error for ytex:
Element type "hibernate-mapping" must be followed by either attribute 
specifications, ">" or "/>".UMLS.hbm.template.xml 
/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/umls/model   
line 27   XML Problem


Re: Resources for current Version

2015-07-17 Thread Chen, Pei
Florian,
If you're building it from source, all of the resources should be automatically 
downloaded for you via maven. 

Sent from my iPhone

> On Jul 17, 2015, at 4:11 PM, Braun, Florian A  wrote:
> 
> Hello,
> 
> I am trying to set up the source code and am following the developer install 
> guide. When I get to the part about the resources, which resources should I 
> using? The 3.2.0 or the 3.2.1.1-bin? I would think the -bin ones are compiled?
> 
> Also, I am getting an xml error for ytex:
> Element type "hibernate-mapping" must be followed by either attribute 
> specifications, ">" or "/>".UMLS.hbm.template.xml 
> /ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/umls/model   
> line 27   XML Problem