RE: ctakes umlsuserapprover authentication error

2015-07-15 Thread Finan, Sean
Hi Stuart,

That is strange.  Both dictionaries validate the username in the same way, so 
if it works for one it should work for the other.  I am guessing that this 
problem is repeatable, but I have to ask if you've tested the different 
dictionaries on different machines or different locations, i.e. at work vs. at 
home?

Sean

-Original Message-
From: Taylor, Stuart [mailto:sxt127...@utdallas.edu] 
Sent: Tuesday, July 14, 2015 7:43 PM
To: dev@ctakes.apache.org
Subject: ctakes umlsuserapprover authentication error

Hello,

I have installed ctakes 3.2.2 using the instructions linked to on the download 
page, but I am currently receiving the following error when I run 
runctakesCVD.sh and try to load AggregatePlaintextFastUMLSProcessor.xml


14 Jul 2015 16:01:43 ERROR UmlsUserApprover - UMLS Account at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_restful_isValidUMLSUser&d=BQIFAw&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=YcKCOGUNuruPYTOcsDWd4M9bw8P_64--W_XP0HSfnGg&s=VRL8gkjdZfHEkQaiidoaNNhJMN9S426oZSLk4fuv7mY&e=
  is not valid for user my_username with my_password


where I replaced my actual username with my_username, and my actual password 
with my_password. I verified the information by logging into the umls website 
by copy/pasting the username/password from the error message into the login 
fields.

When poking around to see if anyone else had this problem I noticed that it was 
an issue with version 3.2.1, but that it had been fixed in version 3.2.2.

I can load and run AggregatePlaintextUMLSProcessor.xml without any errors.

In case it is relevant my UMLS license got approved earlier today.


RE: periods and the interaction with PTB & Fast Dict Lookup.

2015-07-15 Thread Finan, Sean
Hi Britt,

The dictionary should be using ptb tokenization, but I obviously missed a rule 
and separated the . from the following 2 in the dictionary.

I will double-check everything.

Sean

p.s. if you don’t mind my asking, are you looking into all connective tissue 
disorders or just Shprintzen?

From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
Sent: Tuesday, July 14, 2015 3:58 PM
To: dev@ctakes.apache.org
Subject: periods and the interaction with PTB & Fast Dict Lookup.

Another question/topic likely for Sean & Tim. Happy to get others’ feedback as 
well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & 
dictionary building will split a string into multiple tokens if it is not a 
number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] 
in the new dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 
22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be 
irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup 
indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” 
because its a number.

It looks like this concept might not be possible to be identified unless CD is 
allowed as a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB 
rules might not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location 
mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to 
produce the same components?

Let me know if I read into one of these points wrong. Since these items would 
likely cause large changes I am looking to get some feedback before moving 
forward.

Cheers,

Britt









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com



Re: periods and the interaction with PTB & Fast Dict Lookup.

2015-07-15 Thread britt fitch
Thanks Sean.

The other part of the concern is if its reasonable/feasible to alter 
tokenization rules for things like gene locations. I can work around this in a 
few ways but if there are other examples of how this might come up in other 
cases it could be worth looking at a blanket change. Sadly I don’t have another 
example off the top of my head, maybe organism names? Doing a few queries for 
terms in the UMLS with periods the majority of them seem to be things you 
really would want to split on. Perhaps genes are just an edge case.

I was looking at gene locations overall, not any particular gene or disorder 
grouping. The term I mentioned was just meant to be an example.


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 15, 2015, at 8:57 AM, Finan, Sean  
> wrote:
> 
> Hi Britt,
> 
> The dictionary should be using ptb tokenization, but I obviously missed a 
> rule and separated the . from the following 2 in the dictionary.
> 
> I will double-check everything.
> 
> Sean
> 
> p.s. if you don’t mind my asking, are you looking into all connective tissue 
> disorders or just Shprintzen?
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Tuesday, July 14, 2015 3:58 PM
> To: dev@ctakes.apache.org 
> Subject: periods and the interaction with PTB & Fast Dict Lookup.
> 
> Another question/topic likely for Sean & Tim. Happy to get others’ feedback 
> as well.
> 
> I am trying to identify gene related information.
> 
> It appears that the PTB tokenization logic in places like the tokenizer & 
> dictionary building will split a string into multiple tokens if it is not a 
> number and contains a period.
> 
> For example, given “22q11.2 deletion syndrome”:
> 
> PTB tokenizer: [22q11, .2, deletion, syndrome]
> POS for the above term: [CD, CD, NN, NN]
> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
> 
> The same string creates a different split of [22q11, ., 2, deletion, 
> syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
> When the _rareWordTermMap gets created it uses the first token as the key: 
> 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
> 
> The period-split difference above (period alone vs period + number) might be 
> irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup 
> indices are [2,3].
> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” 
> because its a number.
> 
> It looks like this concept might not be possible to be identified unless CD 
> is allowed as a lookup token POS.
> Even if this is allowed though, in the case of gene locations I think the PTB 
> rules might not be the best fit.
> 
> Are there any thoughts/experiences regarding addressing the gene location 
> mentions like this?
> Should the Fast Dict tokenization logic match the PTB tokenizer logic to 
> produce the same components?
> 
> Let me know if I read into one of these points wrong. Since these items would 
> likely cause large changes I am looking to get some feedback before moving 
> forward.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com 
>   >



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: The SegmentRegexAnnotator of Ytex

2015-07-15 Thread vijay garla
Can you make sure you did everything documented here:
https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation
I can see from the stack trace that hibernate is not in the classpath (see
section 'Unzip YTEX Libraries')

Best,

VJ

On Tue, Jul 14, 2015 at 2:41 AM, Oranit Dror  wrote:

> Thank you, Vijay.
> However, I am still encountering with the crash.
>
> Best,
> Oranit.
>
> -Original Message-
> From: vijay garla [mailto:vnga...@gmail.com]
> Sent: Monday, July 13, 2015 5:53 PM
> To: dev@ctakes.apache.org
> Subject: Re: The SegmentRegexAnnotator of Ytex
>
> see https://cwiki.apache.org/confluence/display/CTAKES/User%27s+Guide
>
> best,
>
> vj
>
> On Mon, Jul 13, 2015 at 2:50 AM, Oranit Dror  wrote:
>
> > Hello,
> >
> > I am using ctakes 3.2.2. and recently I have tried to apply the YTEX
> > pipeline. Particularly, I am interested in the SegmentRegexAnnotator of
> > Ytex.
> >
> > My questions are:
> >
> > 1.   When running the pipeline, an
> > org.apache.uima.resource.ResourceInitializationException is thrown,
> > probably due to a failure in the initialization of
> > org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator. Below is
> the
> > stack trace.
> >
> > 2.   Where can I find information on how the SegmentRegexAnnotator
> > works, especially where the list of segments is defined.
> >
> > Thank you,
> > Oranit.
> >
> >
> > The stack trace for the Ytex pipeline crash:
> >
> > 12 יול 2015 09:47:52 ERROR RunEngine - Failed to create AE from xml
> > descriptor
> >
> :E:/Data/Views/oranit_nlp/subprod1/nlp/java/algotec-nlp/desc/desc/algotec-nlp/desc/analysis_engine/AggregateDiseaseYtexUMLSProcessorDescriptor.xml
> > org.apache.uima.resource.ResourceInitializationException: Initialization
> > of annotator class
> > "org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator" failed.
> > (Descriptor: file:/E:/Program
> >
> Files/apache-ctakes-3.2.2-rc2/desc/ctakes-ytex-uima/desc/analysis_engine/SegmentRegexAnnotator.xml)
> >at
> >
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
> >at
> >
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:156)
> >at
> >
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
> >at
> >
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
> >at
> > org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)
> >at
> >
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:387)
> >at
> >
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:254)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:431)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:185)
> >at
> >
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
> >at
> >
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
> >at
> > org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)
> >at
> >
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:387)
> >at
> >
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:254)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:431)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)
> >at
> >
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:185)
> >at
> >
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
> >at
> >
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
> >at
> > org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:269)
> >at
> >
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:354)
> >at com.algotec.nlp.RunEngine.createCasObjects(RunEngine.java:1399)
> >at com.algotec.nlp.RunEngine.ensureCasObjects(RunEngine.java:1373)
> >at com.algotec.nlp.RunEngine.analyze(RunEngine.java:954)
> >at
> >
> com.algotec.nlp.servlet.ReportNLPServlet.doPost(ReportNLPServlet.java:128)
> >at
> >
> com.algotec.nlp.servlet.ReportNLPServlet.doPos

UmlsConcept subject

2015-07-15 Thread Tomasz Oliwa
Hi,

I think there is a regression in the way cTAKES discovers the subject status 
("patient", "familiy_member", etc.) of an UmlsConcept. Using cTAKES 3.2.2 and 
the AggregatePlaintextFastUMLSProcessor in the CVD:

1. "Patient's brother has a myocardial infarction." 
"myocardial infarction" and "infarction" have subject = "patient"

2. "Father had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

3. "Sister was diagnosed with a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

4. "Family member had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "family_member" (this 
is correct)

I am looking at the code of the SubjectCleartkAnalysisEngine. Is this the class 
responsible for inferring the subject?
How can this be fixed? Should I open a JIRA ticket?

Thanks,
Tomasz

RE: UmlsConcept subject

2015-07-15 Thread Chen, Pei
Tomasz,
Yes, please please feel free to open a Jira ticket for this. Also, Be sure to 
include the version of the cTAKES and pipeline you're using.
It is possible that the new Subject Classifier isn't classifying this...

-Original Message-
From: Tomasz Oliwa [mailto:ol...@uchicago.edu] 
Sent: Wednesday, July 15, 2015 2:50 PM
To: dev@ctakes.apache.org
Subject: UmlsConcept subject

Hi,

I think there is a regression in the way cTAKES discovers the subject status 
("patient", "familiy_member", etc.) of an UmlsConcept. Using cTAKES 3.2.2 and 
the AggregatePlaintextFastUMLSProcessor in the CVD:

1. "Patient's brother has a myocardial infarction." 
"myocardial infarction" and "infarction" have subject = "patient"

2. "Father had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

3. "Sister was diagnosed with a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "patient"

4. "Family member had a myocardial infarction."
"myocardial infarction" and "infarction" have subject = "family_member" (this 
is correct)

I am looking at the code of the SubjectCleartkAnalysisEngine. Is this the class 
responsible for inferring the subject?
How can this be fixed? Should I open a JIRA ticket?

Thanks,
Tomasz


RE: UmlsConcept subject

2015-07-15 Thread Tomasz Oliwa
https://issues.apache.org/jira/browse/CTAKES-369 is open now.

Thanks for looking into this. If there is something I could additionally test, 
let me know.