Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator

2015-07-09 Thread Jakob Rogstadius
Hi cTakes devs,

I am trying to use cTakes' UMLS Dictionary Lookup annotator (either the older 
one or the newer fast one) through uimaFIT, and I am running into problems with 
resources that can't be found. Please bear with me if my problem description 
omits any relevant details, as I don't have much experience neither with 
cTakes, UIMA, Java, Maven nor Eclipse.

cTakes is imported into my Eclipse project through Maven, and I have a very 
basic pipeline running with a few annotators from UIMA and cTakes, along with a 
few custom ones. I have specified the UMLS login details in the arguments of 
the Eclipse runtime configuration, which works. However, when I add either a 
UmlsDictionaryLookupAnnotator or its fast version, they fail to resolve their 
respective lookup descriptor files. I have included a stack trace for the first 
method, while the second method throws a null pointer exception on 
AbstractJCasTermAnnotator.initialize() (line 129), due to the fileResource 
variable being null.

I have noticed that since cTakes version 3.1.1, the lookup descriptor file 
referenced in UmlsDictionaryLookupAnnotator.createAnnotatorDescription() is no 
longer included in ctakes-dictionary-lookup-res-3.x.x.jar. I don't know if the 
same change took place for the fast dictionary, but I can see that the xml file 
referenced in DictionaryLookupFactory.createUmlsDictionaryLookupDescription() 
is not present in the ctakes-dictionary-lookup-fast-res-3.2.2.jar that I get 
through Maven. Have these files moved, so that I now need to include something 
else? Am I doing something else wrong?

Also, I have downloaded the UMLS dictionary resources from 
http://ctakes.apache.org/downloads.cgi, but where do I place them for cTakes to 
be able to find them?

Stack trace for UmlsDictionaryLookupAnnotator.createAnnotatorDescription():

java.io.FileNotFoundException: No File exists at 
org/apache/ctakes/dictionary/lookup/LookupDesc_Db.xml
at 
org.apache.ctakes.core.resource.FileLocator.getFullPath(FileLocator.java:162)
at 
org.apache.ctakes.core.resource.FileLocator.locateFile(FileLocator.java:70)
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:118)
at 
org.umc.research.social_media_adr_detection.pipelines.ExtractDrugAndAEMentions.main(ExtractDrugAndAEMentions.java:128)
Exception in thread "main" 
org.apache.uima.resource.ResourceInitializationException
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:156)
at 
org.umc.research.social_media_adr_detection.pipelines.ExtractDrugAndAEMentions.main(ExtractDrugAndAEMentions.java:128)
Caused by: java.io.FileNotFoundException: No File exists at 
org/apache/ctakes/dictionary/lookup/LookupDesc_Db.xml
at 
org.apache.ctakes.core.resource.FileLocator.getFullPath(FileLocator.java:162)
at 
org.apache.ctakes.core.resource.FileLocator.locateFile(FileLocator.java:70)
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:118)
... 1 more

Any pointers would be greatly appreciated.

Best regards,

Jakob Rogstadius
Research Engineer

Uppsala Monitoring Centre
WHO Collaborating Centre for International Drug Monitoring


Re: dictionary-look-fast fails to handle alternative CUIs

2015-07-09 Thread britt fitch
I don’t think that is too much of a constraint, at least initially, to have all 
CUI values a consistent length for a given prefix.

Thanks Sean, let me know if there is any part of this you’d like a hand with.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 8, 2015, at 7:16 PM, Finan, Sean  
> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left 
> work right before I went to the store … but I’m now back to it and I’m going 
> to move forward with the tiny bot that I’ve got.  I don’t think that it will 
> take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” 
> won’t get converted to “CN0123456” by assuming that it is a seven digit 
> numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, 
> SEAN02, SEAN03…” through 99.  Then their output would still be formatted as 
> “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that 
> too much of a restraint?  Hmmm.  Well, I’m going to push forward with this 
> idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org 
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related 
> to sort/search efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals 
> and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. 
> Ultimately CN123456 would become 0314123456 but I don’t think its 
> sophisticated enough to avoid issues with leading zeros. We could prepend a 9 
> to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>discard 9
>while (length > 8)
>convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair(3, NLM) at Pair[0]
> Produce a Long of 0x1000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/1000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the 
> prefix in a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting 
> it out into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com 
>   >
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean    >> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database 
> that we use has roughly half a million cuis.  Storing cuis in the various 
> tables as longs takes up a lot less space than storing them as 8 character 
> strings.
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org 
>  >
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 
> numerals, padding with “0" as needed to reach that length when applicable 
> [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies 
> (MedGen) where the bulk of concepts use UMLS CUIs but some additional 
> concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, 
> resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are 
> running into this issue and if the community wants a solution to providing a 
> CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating 
> the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats 
> what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin

RE: Cannot resolve lookup descriptor files for UmlsDictionaryLookupAnnotator

2015-07-09 Thread Finan, Sean
Hi Jakob,

Where those files exist really depends upon how you are trying to run.  They 
start in src/main/resources/ directories in their respective -res projects.

If you are running from an IDE, make sure that the -res modules have been added 
to your project, and that the src/main/resources/ directories have been tagged 
as resource directories.  I'm not an Eclipse expert either (I don't use it), 
but if it imports via maven it should be doing that automatically, or maybe 
tagging them as source directories.

If you are running from a full build of the application, there should be a 
resources/ directory in your root.  If that directory exists and contains the 
.xml files, either run the app from the root directory or set $CTAKES_HOME to 
point to that root.

If you run with today's build you should see a listing of your classpath upon 
that error - which may or may not help you find the problem.

Sean


-Original Message-
From: Jakob Rogstadius [mailto:jakob.rogstad...@who-umc.org] 
Sent: Thursday, July 09, 2015 8:43 AM
To: dev@ctakes.apache.org
Subject: Cannot resolve lookup descriptor files for 
UmlsDictionaryLookupAnnotator

Hi cTakes devs,

I am trying to use cTakes' UMLS Dictionary Lookup annotator (either the older 
one or the newer fast one) through uimaFIT, and I am running into problems with 
resources that can't be found. Please bear with me if my problem description 
omits any relevant details, as I don't have much experience neither with 
cTakes, UIMA, Java, Maven nor Eclipse.

cTakes is imported into my Eclipse project through Maven, and I have a very 
basic pipeline running with a few annotators from UIMA and cTakes, along with a 
few custom ones. I have specified the UMLS login details in the arguments of 
the Eclipse runtime configuration, which works. However, when I add either a 
UmlsDictionaryLookupAnnotator or its fast version, they fail to resolve their 
respective lookup descriptor files. I have included a stack trace for the first 
method, while the second method throws a null pointer exception on 
AbstractJCasTermAnnotator.initialize() (line 129), due to the fileResource 
variable being null.

I have noticed that since cTakes version 3.1.1, the lookup descriptor file 
referenced in UmlsDictionaryLookupAnnotator.createAnnotatorDescription() is no 
longer included in ctakes-dictionary-lookup-res-3.x.x.jar. I don't know if the 
same change took place for the fast dictionary, but I can see that the xml file 
referenced in DictionaryLookupFactory.createUmlsDictionaryLookupDescription() 
is not present in the ctakes-dictionary-lookup-fast-res-3.2.2.jar that I get 
through Maven. Have these files moved, so that I now need to include something 
else? Am I doing something else wrong?

Also, I have downloaded the UMLS dictionary resources from 
https://urldefense.proofpoint.com/v2/url?u=http-3A__ctakes.apache.org_downloads.cgi&d=BQIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=HSSKy6iWmdl_k4qE22gY7Kb6oTqcfm2ZxVdjOzcjPyc&s=JO8tKCrrsRQMpvAb1evFEHf21pc1SFE7AjXaQGZEyIg&e=
 , but where do I place them for cTakes to be able to find them?

Stack trace for UmlsDictionaryLookupAnnotator.createAnnotatorDescription():

java.io.FileNotFoundException: No File exists at 
org/apache/ctakes/dictionary/lookup/LookupDesc_Db.xml
at 
org.apache.ctakes.core.resource.FileLocator.getFullPath(FileLocator.java:162)
at 
org.apache.ctakes.core.resource.FileLocator.locateFile(FileLocator.java:70)
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:118)
at 
org.umc.research.social_media_adr_detection.pipelines.ExtractDrugAndAEMentions.main(ExtractDrugAndAEMentions.java:128)
Exception in thread "main" 
org.apache.uima.resource.ResourceInitializationException
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:156)
at 
org.umc.research.social_media_adr_detection.pipelines.ExtractDrugAndAEMentions.main(ExtractDrugAndAEMentions.java:128)
Caused by: java.io.FileNotFoundException: No File exists at 
org/apache/ctakes/dictionary/lookup/LookupDesc_Db.xml
at 
org.apache.ctakes.core.resource.FileLocator.getFullPath(FileLocator.java:162)
at 
org.apache.ctakes.core.resource.FileLocator.locateFile(FileLocator.java:70)
at 
org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator.createAnnotatorDescription(UmlsDictionaryLookupAnnotator.java:118)
... 1 more

Any pointers would be greatly appreciated.

Best regards,

Jakob Rogstadius
Research Engineer

Uppsala Monitoring Centre
WHO Collaborating Centre for International Drug Monitoring


RE: dictionary-look-fast fails to handle alternative CUIs

2015-07-09 Thread Finan, Sean
Hi Britt,

I’ve got some code and tests to check in.  Would you like to write the jira 
item?

From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
Sent: Thursday, July 09, 2015 8:55 AM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

I don’t think that is too much of a constraint, at least initially, to have all 
CUI values a consistent length for a given prefix.

Thanks Sean, let me know if there is any part of this you’d like a hand with.

Cheers,

Britt









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

On Jul 8, 2015, at 7:16 PM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>> 
wrote:

Hi Britt,

You’ve got it exactly.

I actually started working on this right before a meeting right before I left 
work right before I went to the store … but I’m now back to it and I’m going to 
move forward with the tiny bot that I’ve got.  I don’t think that it will take 
too long …

One reason that I like the “pair” idea is that something like “CN123456” won’t 
get converted to “CN0123456” by assuming that it is a seven digit numerical 
base. Likewise somebody could make a tiny dictionary with “SEAN01, SEAN02, 
SEAN03…” through 99.  Then their output would still be formatted as “SEAN01 .. 
SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that too much of a 
restraint?  Hmmm.  Well, I’m going to push forward with this idea.

I’ll check in whatever I get done tonight.

Cheers,
Sean


From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 4:21 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Thanks for the details Sean. I had assumed the conversion to Long was related 
to sort/search efficiency but that makes sense.

I had been thinking of something similar with parsing out the non-numerals and 
converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately 
CN123456 would become 0314123456 but I don’t think its sophisticated enough to 
avoid issues with leading zeros. We could prepend a 9 to it to avoid losing 
digits and use something like:

if(length>8 && begins with 9)
   discard 9
   while (length > 8)
   convert first 2 numbers to a letter

I think your suggestion sounds good to me. To run the example through:

“NLM300" gets parsed to “NLM” + “300”
Store Pair(3, NLM) at Pair[0]
Produce a Long of 0x1000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/1000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300

In that case, do we need to store it as a Pair at all or is just storing the 
prefix in a String[] sufficient?

I’m happy to start working on this unless you have a preference for splitting 
it out into multiple tasks.










Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

On Jul 8, 2015, at 2:54 PM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>>
 wrote:

By the way, in case you are wondering why it does this … the umls database that 
we use has roughly half a million cuis.  Storing cuis in the various tables as 
longs takes up a lot less space than storing them as 8 character strings.

From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 2:23 PM
To: 
dev@ctakes.apache.org
Subject: dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 
numerals, padding with “0" as needed to reach that length when applicable [see 
CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) 
where the bulk of concepts use UMLS CUIs but some additional concepts were 
created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, 
resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are 
running into this issue and if the community wants a solution to providing a 
CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the 
CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what 
makes the most sense.

Thoughts?









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

Re: dictionary-look-fast fails to handle alternative CUIs

2015-07-09 Thread britt fitch
Absolutely. I’ll create it now.

Thanks!



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 9, 2015, at 3:12 PM, Finan, Sean  
> wrote:
> 
> Hi Britt,
> 
> I’ve got some code and tests to check in.  Would you like to write the jira 
> item?
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Thursday, July 09, 2015 8:55 AM
> To: dev@ctakes.apache.org 
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> I don’t think that is too much of a constraint, at least initially, to have 
> all CUI values a consistent length for a given prefix.
> 
> Thanks Sean, let me know if there is any part of this you’d like a hand with.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com
> 
> On Jul 8, 2015, at 7:16 PM, Finan, Sean    >> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left 
> work right before I went to the store … but I’m now back to it and I’m going 
> to move forward with the tiny bot that I’ve got.  I don’t think that it will 
> take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” 
> won’t get converted to “CN0123456” by assuming that it is a seven digit 
> numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, 
> SEAN02, SEAN03…” through 99.  Then their output would still be formatted as 
> “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that 
> too much of a restraint?  Hmmm.  Well, I’m going to push forward with this 
> idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org 
>  >
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related 
> to sort/search efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals 
> and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. 
> Ultimately CN123456 would become 0314123456 but I don’t think its 
> sophisticated enough to avoid issues with leading zeros. We could prepend a 9 
> to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>   discard 9
>   while (length > 8)
>   convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair(3, NLM) at Pair[0]
> Produce a Long of 0x1000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/1000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the 
> prefix in a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting 
> it out into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com 
> britt.fi...@wiredinformatics.com 
>   
> >  >
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean    
> >  >> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database 
> that we use has roughly half a million cuis.  Storing cuis in the various 
> tables as longs takes up a lot less space than storing them as 8 character 
> strings.
> 
> From: britt fitch [mailto:britt.fi...@wiredinformatics.com 
> ]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org 
>  > >
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other

Re: dictionary-look-fast fails to handle alternative CUIs

2015-07-09 Thread britt fitch
Linking ticket here for completeness 
https://issues.apache.org/jira/browse/CTAKES-368 



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Jul 9, 2015, at 3:19 PM, britt fitch  
> wrote:
> 
> Absolutely. I’ll create it now.



signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: dictionary-look-fast fails to handle alternative CUIs

2015-07-09 Thread Finan, Sean
Checked in, please give it a test and close the ticket if it fits your purposes.

From: britt fitch [mailto:britt.fi...@wiredinformatics.com]
Sent: Thursday, July 09, 2015 3:30 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Linking ticket here for completeness 
https://issues.apache.org/jira/browse/CTAKES-368









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

On Jul 9, 2015, at 3:19 PM, britt fitch 
mailto:britt.fi...@wiredinformatics.com>> 
wrote:

Absolutely. I’ll create it now.