Re: upcoming ctakes-temporal bundled models

2014-06-30 Thread Miller, Timothy
Do you just mean a brief high-level explanation of how they work?
Tim

On 06/24/2014 04:04 PM, Pei Chen wrote:
> Does anyone happen to have a quick/simple README about the current best
> performing models that is being included?
> BackwardsTime
> Event
> DocTimeRel
> ContextualModality
>



Bacterium Dictionary

2014-06-30 Thread Nick Nikandish
Hi there,

I was wondering if Ctakes has any Bacterium Dictionary? I need to extract 
information for bacteria like "Enterococcus Faecium", "Pseudomonas Aeruginosa " 
, etc  and I was wondering if I can do it by using Ctakes annotators?

Thanks,

Nick Nikandish
Product Development Software Engineer
Clinical Research Informatics

Emerging Health
Montefiore Information Technology
6 Executive Blvd. Suite 290, Yonkers, NY 10701
914-457-6792 Office
snika...@montefiore.org
www.emerginghealthit.com
www.montefiore.org

[logo-montefiore-it]



Re: [VOTE] Release Apache cTAKES 3.2.0

2014-06-30 Thread andy mcmurry
The LVG is particularly important, users are most likely to compare cTAKES
and MetaMap (or other) performance on the NER task and the LVG clearly
helps. Requiring a new user to understand LVG is the hard part, the
download isn't a tough requirement but I wonder how many new users would
even know to do that. Even with clear instructions .


On Sat, Jun 28, 2014 at 7:25 PM, Masanz, James J. 
wrote:

> The release notes include some JIRA issues that are open (and I think some
> that have not had any changes done for them)
>
> Example of one that has not been implemented as far as I know:
> https://issues.apache.org/jira/i#browse/CTAKES-122
>
> Example of one that has status=open and Resolution=Unresolved
> https://issues.apache.org/jira/i#browse/CTAKES-224
>
> There are others
>
> -- James
>
> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Friday, June 27, 2014 5:16 PM
> To: dev@ctakes.apache.org
> Subject: [VOTE] Release Apache cTAKES 3.2.0
>
> Hi all,
>
> This is a call for a vote on releasing the following candidate (rc1) as
> Apache cTAKES 3.2.0.
> The major changes include:
> - New optional YTEX component(s) (Yale Extensions to cTAKES)
> - New optional improved/faster dictionary lookup (dictionary-lookup-fast)
> - New optional Temporal component (Time + Event extraction.  Relations will
> be including in a future release.)
> - Other bug fixes/enhancements from Jira
>
> [TODO: Online documentation still needs to be updated on wiki for the abo]
>
> For more detailed information on the changes/release notes, please visit:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621&version=12324066
>
> The release was made using the cTAKES release process documented here:
> http://ctakes.apache.org/ctakes-release-guide.html
>
> The candidate is available at:
>
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz
> /.zip
>
> The tag to be voted on:
> http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/
>
> The MD5 checksum of the tarball can be found at:
>
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.md5
> /.zip.md5
>
> The signature of the tarball can be found at:
>
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.asc
> /.zip.asc
>
> Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
> https://dist.apache.org/repos/dist/release/ctakes/KEYS
>
> Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
> open for at least the next 72 hours.
> Only votes from the cTAKES PMC are binding, but folks are welcome to check
> the release candidate and voice their approval or disapproval.
> The vote passes if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache cTAKES 3.2.0
> [ ] -1 Do not release the packages because...
>
> Also, the convenience binary can be found at:
>
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-bin.tar.gz
> /.zip
> Note: It's tempoarily on people.a.o because the artifacts were too large
> for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
> increasing the limit).
>
>
> Thanks!
>


RE: upcoming ctakes-temporal bundled models

2014-06-30 Thread Chen, Pei
I think it would be nice to have both?
- some info about the models - i.e. the training set used, etc.
- as well as high level explanation?

> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Monday, June 30, 2014 10:28 AM
> To: dev@ctakes.apache.org
> Subject: Re: upcoming ctakes-temporal bundled models
> 
> Do you just mean a brief high-level explanation of how they work?
> Tim
> 
> On 06/24/2014 04:04 PM, Pei Chen wrote:
> > Does anyone happen to have a quick/simple README about the current
> > best performing models that is being included?
> > BackwardsTime
> > Event
> > DocTimeRel
> > ContextualModality
> >



Re: Bacterium Dictionary

2014-06-30 Thread Pei Chen
Nick,
I am not sure how complete it is, but I believe the UMLS has the semantic
type of

Bacterium


 [T007]
  It's most likely not included in the default cTAKES dictionaries though...

Thanks,
Pei


On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish <
snika...@emerginghealthit.com> wrote:

>  Hi there,
>
>
>
> I was wondering if Ctakes has any Bacterium Dictionary? I need to extract
> information for bacteria like “Enterococcus Faecium”, “Pseudomonas
> Aeruginosa “ , etc  and I was wondering if I can do it by using Ctakes
> annotators?
>
>
>
> Thanks,
>
>
>
> *Nick Nikandish*
>
> *Product Development Software Engineer*
>
> Clinical Research Informatics
>
>
>
> *Emerging Health*
>
> *Montefiore Information Technology*
>
> 6 Executive Blvd. Suite 290, Yonkers, NY 10701
>
> 914-457-6792 Office
>
> snika...@montefiore.org
>
> www.emerginghealthit.com
>
> www.montefiore.org
>
>
>
> [image: logo-montefiore-it]
>
>
>


RE: [VOTE] Release Apache cTAKES 3.2.0

2014-06-30 Thread Chen, Pei
Thanks James.
I just did a Jira review for 3.2.  There are just 2 remaining items that are 
pending some clarification from respective dev.  Otherwise, it should be up to 
date now- any items that didn't make it to 3.2 have been updated to 3.2.1 
instead now.
--Pei

> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Saturday, June 28, 2014 10:26 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: [VOTE] Release Apache cTAKES 3.2.0
> 
> The release notes include some JIRA issues that are open (and I think some
> that have not had any changes done for them)
> 
> Example of one that has not been implemented as far as I know:
> https://issues.apache.org/jira/i#browse/CTAKES-122
> 
> Example of one that has status=open and Resolution=Unresolved
> https://issues.apache.org/jira/i#browse/CTAKES-224
> 
> There are others
> 
> -- James
> 
> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Friday, June 27, 2014 5:16 PM
> To: dev@ctakes.apache.org
> Subject: [VOTE] Release Apache cTAKES 3.2.0
> 
> Hi all,
> 
> This is a call for a vote on releasing the following candidate (rc1) as Apache
> cTAKES 3.2.0.
> The major changes include:
> - New optional YTEX component(s) (Yale Extensions to cTAKES)
> - New optional improved/faster dictionary lookup (dictionary-lookup-fast)
> - New optional Temporal component (Time + Event extraction.  Relations will
> be including in a future release.)
> - Other bug fixes/enhancements from Jira
> 
> [TODO: Online documentation still needs to be updated on wiki for the abo]
> 
> For more detailed information on the changes/release notes, please visit:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621
> &version=12324066
> 
> The release was made using the cTAKES release process documented here:
> http://ctakes.apache.org/ctakes-release-guide.html
> 
> The candidate is available at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz
> /.zip
> 
> The tag to be voted on:
> http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/
> 
> The MD5 checksum of the tarball can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz.md5
> /.zip.md5
> 
> The signature of the tarball can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz.asc
> /.zip.asc
> 
> Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
> https://dist.apache.org/repos/dist/release/ctakes/KEYS
> 
> Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
> open for at least the next 72 hours.
> Only votes from the cTAKES PMC are binding, but folks are welcome to check
> the release candidate and voice their approval or disapproval.
> The vote passes if at least three binding +1 votes are cast.
> 
> [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not release the
> packages because...
> 
> Also, the convenience binary can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-bin.tar.gz
> /.zip
> Note: It's tempoarily on people.a.o because the artifacts were too large for
> https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
> increasing the limit).
> 
> 
> Thanks!


RE: [VOTE] Release Apache cTAKES 3.2.0

2014-06-30 Thread Masanz, James J.
Thanks!  I haven't downloaded and reviewed the packages themselves yet but I do 
plan to at least start on that today.

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Monday, June 30, 2014 1:47 PM
To: dev@ctakes.apache.org
Subject: RE: [VOTE] Release Apache cTAKES 3.2.0

Thanks James.
I just did a Jira review for 3.2.  There are just 2 remaining items that are 
pending some clarification from respective dev.  Otherwise, it should be up to 
date now- any items that didn't make it to 3.2 have been updated to 3.2.1 
instead now.
--Pei

> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Saturday, June 28, 2014 10:26 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: [VOTE] Release Apache cTAKES 3.2.0
> 
> The release notes include some JIRA issues that are open (and I think some
> that have not had any changes done for them)
> 
> Example of one that has not been implemented as far as I know:
> https://issues.apache.org/jira/i#browse/CTAKES-122
> 
> Example of one that has status=open and Resolution=Unresolved
> https://issues.apache.org/jira/i#browse/CTAKES-224
> 
> There are others
> 
> -- James
> 
> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Friday, June 27, 2014 5:16 PM
> To: dev@ctakes.apache.org
> Subject: [VOTE] Release Apache cTAKES 3.2.0
> 
> Hi all,
> 
> This is a call for a vote on releasing the following candidate (rc1) as Apache
> cTAKES 3.2.0.
> The major changes include:
> - New optional YTEX component(s) (Yale Extensions to cTAKES)
> - New optional improved/faster dictionary lookup (dictionary-lookup-fast)
> - New optional Temporal component (Time + Event extraction.  Relations will
> be including in a future release.)
> - Other bug fixes/enhancements from Jira
> 
> [TODO: Online documentation still needs to be updated on wiki for the abo]
> 
> For more detailed information on the changes/release notes, please visit:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621
> &version=12324066
> 
> The release was made using the cTAKES release process documented here:
> http://ctakes.apache.org/ctakes-release-guide.html
> 
> The candidate is available at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz
> /.zip
> 
> The tag to be voted on:
> http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/
> 
> The MD5 checksum of the tarball can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz.md5
> /.zip.md5
> 
> The signature of the tarball can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-src.tar.gz.asc
> /.zip.asc
> 
> Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
> https://dist.apache.org/repos/dist/release/ctakes/KEYS
> 
> Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
> open for at least the next 72 hours.
> Only votes from the cTAKES PMC are binding, but folks are welcome to check
> the release candidate and voice their approval or disapproval.
> The vote passes if at least three binding +1 votes are cast.
> 
> [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not release the
> packages because...
> 
> Also, the convenience binary can be found at:
> http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-
> ctakes-3.2.0-bin.tar.gz
> /.zip
> Note: It's tempoarily on people.a.o because the artifacts were too large for
> https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
> increasing the limit).
> 
> 
> Thanks!


RE: Bacterium Dictionary

2014-06-30 Thread Finan, Sean
Hi Nick,
There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS 
2013aa.  They aren't in the cTakes dictionary, but you can build a separate 
bacteria dictionary using the dictionary creator tool in cTakes sandbox.  It 
can create dictionaries formatted for use with both available 
cTakes-dictionary-lookup modules.  I have a full living beings dictionary, if 
you want to somehow confirm your umls license then I could pull out the 
bacteria for you.
Sean

> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Monday, June 30, 2014 12:50 PM
> To: dev@ctakes.apache.org
> Subject: Re: Bacterium Dictionary
> 
> Nick,
> I am not sure how complete it is, but I believe the UMLS has the semantic type
> of
> 
> Bacterium
> 
> 
>  [T007]
>   It's most likely not included in the default cTAKES dictionaries though...
> 
> Thanks,
> Pei
> 
> 
> On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish <
> snika...@emerginghealthit.com> wrote:
> 
> >  Hi there,
> >
> >
> >
> > I was wondering if Ctakes has any Bacterium Dictionary? I need to
> > extract information for bacteria like “Enterococcus Faecium”,
> > “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do it by
> > using Ctakes annotators?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > *Nick Nikandish*
> >
> > *Product Development Software Engineer*
> >
> > Clinical Research Informatics
> >
> >
> >
> > *Emerging Health*
> >
> > *Montefiore Information Technology*
> >
> > 6 Executive Blvd. Suite 290, Yonkers, NY 10701
> >
> > 914-457-6792 Office
> >
> > snika...@montefiore.org
> >
> > www.emerginghealthit.com
> >
> > www.montefiore.org
> >
> >
> >
> > [image: logo-montefiore-it]
> >
> >
> >


RE: Bacterium Dictionary

2014-06-30 Thread Nick Nikandish
Hi Sean,

Thanks for the info. I have written an application for clinical text using 
Ctakes where one of the  annotator  retrieves and identifies the bacterium in 
clinical  texts but it uses a small library that I created. Therefore I would 
like to check those texts against a comprehensive library like UML. I have UMLS 
account and but I was wondering how to utilize Ctakes to use that library. It 
will be great if there were some documents on building a separate dictionary 
using the dictionary creator.


Thanks again,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Monday, June 30, 2014 3:37 PM
To: dev@ctakes.apache.org
Subject: RE: Bacterium Dictionary

Hi Nick,
There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS 
2013aa.  They aren't in the cTakes dictionary, but you can build a separate 
bacteria dictionary using the dictionary creator tool in cTakes sandbox.  It 
can create dictionaries formatted for use with both available 
cTakes-dictionary-lookup modules.  I have a full living beings dictionary, if 
you want to somehow confirm your umls license then I could pull out the 
bacteria for you.
Sean

> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Monday, June 30, 2014 12:50 PM
> To: dev@ctakes.apache.org
> Subject: Re: Bacterium Dictionary
> 
> Nick,
> I am not sure how complete it is, but I believe the UMLS has the 
> semantic type of
> 
> Bacterium
> 
> 
>  [T007]
>   It's most likely not included in the default cTAKES dictionaries though...
> 
> Thanks,
> Pei
> 
> 
> On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish < 
> snika...@emerginghealthit.com> wrote:
> 
> >  Hi there,
> >
> >
> >
> > I was wondering if Ctakes has any Bacterium Dictionary? I need to 
> > extract information for bacteria like “Enterococcus Faecium”, 
> > “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do it 
> > by using Ctakes annotators?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > *Nick Nikandish*
> >
> > *Product Development Software Engineer*
> >
> > Clinical Research Informatics
> >
> >
> >
> > *Emerging Health*
> >
> > *Montefiore Information Technology*
> >
> > 6 Executive Blvd. Suite 290, Yonkers, NY 10701
> >
> > 914-457-6792 Office
> >
> > snika...@montefiore.org
> >
> > www.emerginghealthit.com
> >
> > www.montefiore.org
> >
> >
> >
> > [image: logo-montefiore-it]
> >
> >
> >


RE: Bacterium Dictionary

2014-06-30 Thread Finan, Sean
Hi Nick,

I'm pasting (below) from a howto.txt  in the dictionarytool/doc/ directory.

You will want to do the following:
1.  Download / Install the UMLS dictionary source from NIH  (takes a while)
2.  Create a file named "bacterium.tui" containing the single line "T007"
3.  Decide what ctakes dictionary module to use, the default or newer.  Using 
the default may be faster for you.
4.  Build the dictionary creator tool.  I can send a prebuilt jar if you have 
problems.
5.  java -cp DictionaryTool.jar 
org.apache.ctakes.dictionarytool.DictionaryCreator -fw -umls pathToUmlsRoot 
-tui pathToBacterium.tui -ol sanityCheck.bsv

After running #5 with the path to your umls installation and file with "T007" 
you should have a bar-separated-value file named sanityCheck.bsv containing all 
the bacteria entry CUIs and Text.  If it looks good, then you can use it 
directly or create a hsql database:
1. copy resource/cachedbtemplate/* to yourDbLocation/  You can also use the 
memdbtemplate.
2.  rename the * template files to suit your need (nick_bacteria)
3. .  java -cp DictionaryTool.jar 
org.apache.ctakes.dictionarytool.DictionaryCreator -fw -umls pathToUmlsRoot 
-tui pathToBacterium.tui -db nick_bacteria -tbl nick_bacteria

Gotta run,
Sean



>java -cp DictionaryTool.jar org.apache.ctakes.dictionarytool.DictionaryCreator

Dictionary Creator: Creates a flat file Cui|Text or Database Dictionary from 
UMLS and Orangebook
Database Dictionary can be indexed by each Text's First Word or Rarest Word 
(for the dictionary)
Minimal Usage: DictionaryCreator -umls pathToUmlsRoot -ol pathToFlatFileOutput

-fw Create First Word Index
-umls   Umls Root Directory
-ob Orangebook Path
-fd Format Data Directory
-tuiInput Tui List Path
-srcSource Type List Path
-ol Output Cui and Term List Path
-db Output Database Url
-tblOutput Database Table

The UMLS Root Directory must be specified
One form of output must be specified using either -ol or -db and -tbl
The default index type for databases is Rare Word Index
If an Orangebook Path is not specified then (orangebook) medication terms are 
not written
If a Format Data Directory is not specified then the default is used: 
./data/default
If an Input Tui List Path is not specified then the cTakes Tuis are used: 
./data/default/CtakesAllTuis.txt
If a Source Type List Path is not specified then Snomed is used: 
./data/default/CtakesSources.txt

Important: Dictionary entries are appended to the output file or database.  
Running the same command twice will result in a database with all terms 
existing twice.

The data/default/ directory does include non-default possibilities, such as 
files listing only single cTakes groups:
e.g. CtakesAnatTuis.txt
and all UMLS groups:
UmlsAllTuis.txt
that can be used with the option -tui ./data/default/UmlsAlltuis.txt

There is also a file with all UMLS sources:
UmlsAllSources.txt
that can be used with the option -src ./data/default/UmlsAllSources.txt

Remember that if you want to output to a database you must specify both the url 
and table name:
-db jdbc:hsqldb:file:pathToMyDatabase -tbl myTableName

Also remember that hsqldb requires the entire url to be lowercase.

"Format Data" refers to the data that is used to format the end-result 
dictionary by trimming or expanding the umls entries.
It is recommended that the defaults are used, but you are welcome to experiment 
with your own.


If you are unfamiliar with hsqldb, there are two template / starting point 
databases in the resource/ directory.
cacheddbtemplate/ contains a template for a disk-cached dictionary, and 
memdbtemplate one for a fully in-memory dictionary.
Using an in-memory dictionary is orders of magnitude faster than using a 
disk-cached, but not a good idea for very large (.5GB?) databases.


There are a few other toys that can be found by perusing the source, such as a 
tool that creates a mapping of codes 
for like terms in different dictionaries:
ICD10|ICD9|RXNORM|SNOMEDCT
Usage: java -cp DictionaryTool.jar 
org.apache.ctakes.dictionarytool.CodeMapCreator -umls pathToUmlsRoot -ol 
pathToFlatFileOutput

Some of these extra utilities may be experimental or unfinished, so user beware.



At this time the code could use some javadocs and unit tests, plus a little 
cleanup.  I'm very busy, so volunteer works is appreciated.

Enjoy


> -Original Message-
> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
> Sent: Monday, June 30, 2014 3:50 PM
> To: dev@ctakes.apache.org
> Subject: RE: Bacterium Dictionary
> 
> Hi Sean,
> 
> Thanks for the info. I have written an application for clinical text using 
> Ctakes
> where one of the  annotator  retrieves and identifies the bacterium in 
> clinical
> texts but it uses a small library that I created. Therefore I would like to 
> check
> those texts against a comprehensive library like UML. I have UMLS account and
> but I was wonder

RE: Bacterium Dictionary

2014-06-30 Thread Nick Nikandish
Many thanks Sean. This is very useful. I will follow the instruction and create 
the dictionary.

Thanks,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Monday, June 30, 2014 4:27 PM
To: dev@ctakes.apache.org
Subject: RE: Bacterium Dictionary

Hi Nick,

I'm pasting (below) from a howto.txt  in the dictionarytool/doc/ directory.

You will want to do the following:
1.  Download / Install the UMLS dictionary source from NIH  (takes a while) 2.  
Create a file named "bacterium.tui" containing the single line "T007"
3.  Decide what ctakes dictionary module to use, the default or newer.  Using 
the default may be faster for you.
4.  Build the dictionary creator tool.  I can send a prebuilt jar if you have 
problems.
5.  java -cp DictionaryTool.jar 
org.apache.ctakes.dictionarytool.DictionaryCreator -fw -umls pathToUmlsRoot 
-tui pathToBacterium.tui -ol sanityCheck.bsv

After running #5 with the path to your umls installation and file with "T007" 
you should have a bar-separated-value file named sanityCheck.bsv containing all 
the bacteria entry CUIs and Text.  If it looks good, then you can use it 
directly or create a hsql database:
1. copy resource/cachedbtemplate/* to yourDbLocation/  You can also use the 
memdbtemplate.
2.  rename the * template files to suit your need (nick_bacteria) 3. .  java 
-cp DictionaryTool.jar org.apache.ctakes.dictionarytool.DictionaryCreator -fw 
-umls pathToUmlsRoot -tui pathToBacterium.tui -db nick_bacteria -tbl 
nick_bacteria

Gotta run,
Sean



>java -cp DictionaryTool.jar 
>org.apache.ctakes.dictionarytool.DictionaryCreator

Dictionary Creator: Creates a flat file Cui|Text or Database Dictionary from 
UMLS and Orangebook Database Dictionary can be indexed by each Text's First 
Word or Rarest Word (for the dictionary) Minimal Usage: DictionaryCreator -umls 
pathToUmlsRoot -ol pathToFlatFileOutput

-fw Create First Word Index
-umls   Umls Root Directory
-ob Orangebook Path
-fd Format Data Directory
-tuiInput Tui List Path
-srcSource Type List Path
-ol Output Cui and Term List Path
-db Output Database Url
-tblOutput Database Table

The UMLS Root Directory must be specified One form of output must be specified 
using either -ol or -db and -tbl The default index type for databases is Rare 
Word Index If an Orangebook Path is not specified then (orangebook) medication 
terms are not written If a Format Data Directory is not specified then the 
default is used: ./data/default If an Input Tui List Path is not specified then 
the cTakes Tuis are used: ./data/default/CtakesAllTuis.txt If a Source Type 
List Path is not specified then Snomed is used: ./data/default/CtakesSources.txt

Important: Dictionary entries are appended to the output file or database.  
Running the same command twice will result in a database with all terms 
existing twice.

The data/default/ directory does include non-default possibilities, such as 
files listing only single cTakes groups:
e.g. CtakesAnatTuis.txt
and all UMLS groups:
UmlsAllTuis.txt
that can be used with the option -tui ./data/default/UmlsAlltuis.txt

There is also a file with all UMLS sources:
UmlsAllSources.txt
that can be used with the option -src ./data/default/UmlsAllSources.txt

Remember that if you want to output to a database you must specify both the url 
and table name:
-db jdbc:hsqldb:file:pathToMyDatabase -tbl myTableName

Also remember that hsqldb requires the entire url to be lowercase.

"Format Data" refers to the data that is used to format the end-result 
dictionary by trimming or expanding the umls entries.
It is recommended that the defaults are used, but you are welcome to experiment 
with your own.


If you are unfamiliar with hsqldb, there are two template / starting point 
databases in the resource/ directory.
cacheddbtemplate/ contains a template for a disk-cached dictionary, and 
memdbtemplate one for a fully in-memory dictionary.
Using an in-memory dictionary is orders of magnitude faster than using a 
disk-cached, but not a good idea for very large (.5GB?) databases.


There are a few other toys that can be found by perusing the source, such as a 
tool that creates a mapping of codes for like terms in different dictionaries:
ICD10|ICD9|RXNORM|SNOMEDCT
Usage: java -cp DictionaryTool.jar 
org.apache.ctakes.dictionarytool.CodeMapCreator -umls pathToUmlsRoot -ol 
pathToFlatFileOutput

Some of these extra utilities may be experimental or unfinished, so user beware.



At this time the code could use some javadocs and unit tests, plus a little 
cleanup.  I'm very busy, so volunteer works is appreciated.

Enjoy


> -Original Message-
> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
> Sent: Monday, June 30, 2014 3:50 PM
> To: dev@ctakes.apache.org
> Subject: RE: Bacterium Dictionary
> 
> Hi Sean,
> 
> Thanks for the info. I have written an application f

RE: [VOTE] Release Apache cTAKES 3.2.0

2014-06-30 Thread Masanz, James J.

This is pretty obvious, but since this is a record of what was voted upon, 
note that some of the URLs contain an extra

ctakes-3.2.0/

For example 
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

should be just 
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

-- James


From: Pei Chen [chen...@apache.org]
Sent: Friday, June 27, 2014 5:15 PM
To: dev@ctakes.apache.org
Subject: [VOTE] Release Apache cTAKES 3.2.0

Hi all,

This is a call for a vote on releasing the following candidate (rc1) as
Apache cTAKES 3.2.0.
The major changes include:
- New optional YTEX component(s) (Yale Extensions to cTAKES)
- New optional improved/faster dictionary lookup (dictionary-lookup-fast)
- New optional Temporal component (Time + Event extraction.  Relations will
be including in a future release.)
- Other bug fixes/enhancements from Jira

[TODO: Online documentation still needs to be updated on wiki for the abo]

For more detailed information on the changes/release notes, please visit:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621&version=12324066

The release was made using the cTAKES release process documented here:
http://ctakes.apache.org/ctakes-release-guide.html

The candidate is available at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz
/.zip

The tag to be voted on:
http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/

The MD5 checksum of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.md5
/.zip.md5

The signature of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.asc
/.zip.asc

Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
https://dist.apache.org/repos/dist/release/ctakes/KEYS

Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
open for at least the next 72 hours.
Only votes from the cTAKES PMC are binding, but folks are welcome to check
the release candidate and voice their approval or disapproval.
The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache cTAKES 3.2.0
[ ] -1 Do not release the packages because...

Also, the convenience binary can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-bin.tar.gz
/.zip
Note: It's tempoarily on people.a.o because the artifacts were too large
for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
increasing the limit).


Thanks!