Re: Dependency Parser model data

2015-03-24 Thread Ephi
Thanks!

* 1 **
Regarding the documentation - the documentation for cTAKES 3.2 [1] links to
the Dependency Parser documentation for 3.0 [2], it doesn't seem to have an
updated documentation for this component.

In the page from 3.0 it says simply that clinques.mod is the main
ClearParser model packaged with cTAKES v1.1 and that it is trained on a
corpus of 1600 clinical questions.

* 2 **
Regarding self training of the models - I tried following the documentation
but didn't succeed. The documentation [2] states the following:

1. Download and install the C++ version of liblinear from National Taiwan
University; this requires much less memory than the default Java version.
2.Train a model
To create a model using cTAKES POS tags and lemmas with Eclipse:
1. Create a .min file from .dep (see the section
called "Conversion between formats")
2. Use the UIMA_CPE_GUI---dependency parser launch.
3. Load desc/collection_processing_engine/ClearTrainerPosLemTestCPE.xml
4. Put your filename under "Dependency File"
5. Make sure "Training Mode" is checked
6. Rename the "Dependency Model File" and "Lexicon Directory" according to
what you want.
7. Make sure "Trainer Path" is a valid relative path from
>cTAKES_HOME>/dependency parser to a vaid liblinear binary train file.


Regarding step 2 - cTAKES 3.2 doesn't seem to have the UIMA_CPE_GUI, there
is only bin/runCPE.bat. I tried running this.

Regarding step 3 -
When I tried to load
desc\ctakes-dependency-parser\desc\collection_processing_engine\ClearTrainerPosLemTestCPE.xml
I got an error (snapshot attached)

Any ideas?

Thanks, Ephi

[1]
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Component+Use+Guide
[2]
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Dependency+Parser+and+Semantic+Role+Labeler

On Mon, Mar 16, 2015 at 6:46 AM, Pei Chen  wrote:

> Ephi,
> The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
> contain much more.  They should contain at least MiPACQ and SHARP training
> data.  Could you point us to the documentation so we can update it?  I
> believe the break down was:
>
>
>- Clinical questions: 1,600 sentences, 30,138 tokens.
>- Medpedia articles: 2,796 sentences, 49,922 tokens.
>- MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
>- MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
>- Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
>- Seattle group health pathological notes: 2,294 sentences, 34,384
>tokens.
>- SHARP clinical notes: 6,787 sentences, 94,205 tokens.
>- SHARP stratified: 4,316 sentences, 43,037 tokens.
>- SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
>- TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
>- TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.
>
> There are some discussions on appending/augmenting the existing
> annotated/training data[2].  I think the short answer is that there is
> currently no easy way short of having to sign DUA's from every single
> source institution.
>
> [1] http://svn.apache.org/r1465043
> [2]
>
> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3ce5a9fa5abbf1ca4085d4f0794852a51e24241...@chexmbx3a.chboston.org%3E
>
>
> On Sun, Mar 15, 2015 at 11:58 AM, Ephi  wrote:
>
> > Hi -
> >
> > From the documentation, the data used to train the dep parser in cTAKES
> > seems to be 1600 clinical questions (from the Mayo clinic?).
> >
> > Is there a way to retrieve this data in order to retrain the model (while
> > adding on additional data) ?
> >
> > Thanks!
> > Ephi
> >
>


Re: Dependency Parser model data

2015-03-24 Thread Ephi
Update:

I tried loading the descriptor file via the menu File -> Open CPE Descriptor

This throws the following exception due to the fact that the
file ClearTrainerPosLemAggregate.xml is missing. From searching the
internet it seems that this file had been included in cTAKES 2.0  but is
not existent in the latest cTAKES.

C:\apache-ctakes-3.2.1>java -cp
"C:\apache-ctakes-3.2.1/desc/;C:\apache-ctakes-3.2.1/resources/;C:\apache-ctakes-3.2.1/lib/*"
-Dlog4j.configuration=file:/C:\apache-ctakes-3.2.1/con
fig/log4j.xml -Xms512M -Xmx3g org.apache.uima.tools.cpm.CpmFrame
Error loading CPE Descriptor
C:\apache-ctakes-3.2.1\desc\ctakes-dependency-parser\desc\collection_processing_engine\ClearTrainerPosLemTestCPE.xml
java.io.FileNotFoundException:
C:\apache-ctakes-3.2.1\desc\ctakes-dependency-parser\desc\analysis_engine\ClearTrainerPosLemAggregate.xml
(The system cannot find the file specified)

at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(Unknown Source)
at java.io.FileInputStream.(Unknown Source)
at sun.net.www.protocol.file.FileURLConnection.connect(Unknown
Source)
at
sun.net.www.protocol.file.FileURLConnection.getInputStream(Unknown Source)
at
org.apache.uima.util.XMLInputSource.(XMLInputSource.java:120)
at
org.apache.uima.tools.cpm.CpmPanel.openCpeDescriptor(CpmPanel.java:1789)
at
org.apache.uima.tools.cpm.CpmPanel.readPreferences(CpmPanel.java:538)
at org.apache.uima.tools.cpm.CpmPanel.(CpmPanel.java:419)
at org.apache.uima.tools.cpm.CpmFrame.(CpmFrame.java:94)
at org.apache.uima.tools.cpm.CpmFrame.initGUI(CpmFrame.java:178)
at org.apache.uima.tools.cpm.CpmFrame.access$000(CpmFrame.java:49)
at org.apache.uima.tools.cpm.CpmFrame$1.run(CpmFrame.java:168)
at java.awt.event.InvocationEvent.dispatch(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$400(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown
Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown
Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown
Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)


On Tue, Mar 24, 2015 at 11:49 AM, Ephi  wrote:

> Thanks!
>
> * 1 **
> Regarding the documentation - the documentation for cTAKES 3.2 [1] links
> to the Dependency Parser documentation for 3.0 [2], it doesn't seem to have
> an updated documentation for this component.
>
> In the page from 3.0 it says simply that clinques.mod is the main
> ClearParser model packaged with cTAKES v1.1 and that it is trained on a
> corpus of 1600 clinical questions.
>
> * 2 **
> Regarding self training of the models - I tried following the
> documentation but didn't succeed. The documentation [2] states the
> following:
>
> 1. Download and install the C++ version of liblinear from National Taiwan
> University; this requires much less memory than the default Java version.
> 2.Train a model
> To create a model using cTAKES POS tags and lemmas with Eclipse:
> 1. Create a .min file from .dep (see the section
> called "Conversion between formats")
> 2. Use the UIMA_CPE_GUI---dependency parser launch.
> 3. Load desc/collection_processing_engine/ClearTrainerPosLemTestCPE.xml
> 4. Put your filename under "Dependency File"
> 5. Make sure "Training Mode" is checked
> 6. Rename the "Dependency Model File" and "Lexicon Directory" according to
> what you want.
> 7. Make sure "Trainer Path" is a valid relative path from
> >cTAKES_HOME>/dependency parser to a vaid liblinear binary train file.
>
>
> Regarding step 2 - cTAKES 3.2 doesn't seem to have the UIMA_CPE_GUI, there
> is only bin/runCPE.bat. I tried running this.
>
> Regarding step 3 -
> When I tried to load
> desc\ctakes-dependency-parser\desc\collection_processing_engine\ClearTrainerPosLemTestCPE.xml
> I got an error (snapshot attached)
>
> Any ideas?
>
> Thanks, Ephi
>
> [1]
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Component+Use+Guide
> [2]
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Dependency+Parser+and+Semantic+Role+Labeler
>
> On Mon, Mar 16, 2015 at 6:46 AM, Pei Chen  wrote:
>
>> Ephi,
>> The ClearNLP models in the current cTAKES releases (since 3.1.0 [1])
>> should
>> contain much more.  They should contain at least MiPACQ and SHARP training
>> data.  Could you point us to the documentation so we can update it?  I
>> b

Re: Medical de-identification

2015-03-24 Thread britt fitch
Regarding UIMA knowledge I think its helpful to run through the UIMA tutorial 
to get a feel for how pipelines are executed and get familiar with the process 
of building up annotations at each step and then doing something with the final 
result. Running through the tutorial will get you familiar with different 
aspects of a pipeline (reader, annotator, consumer), how they are defined 
(collection processing engines, annotator descriptors), the objects they use 
internally, etc…
Last I saw the tutorial was pretty quick and walked you through the process of 
doing things like identifying room numbers in documents and identifying a 
persons title in documents.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

> On Mar 24, 2015, at 12:09 AM, Rohit Shinde  
> wrote:
> 
> Thanks Britt! I am downloading the source code now and I will install it 
> soon. Right now, I have my mid semester exams for three days, I will come 
> back in three days and start learning about what you have told me.
> 
> I am very familiar with Java. I know very little about UIMA. I know decision 
> trees also very well. And I will learn about ctakes more soon.
> What all should I know about UIMA?
> 
> On Sun, Mar 22, 2015 at 9:28 PM, britt fitch 
> mailto:britt.fi...@wiredinformatics.com>> 
> wrote:
> Sounds good.
> 
> Starting with some references:
> Docs: https://open.med.harvard.edu/wiki/display/SCRUBBER/3.X 
> 
> Publication: http://www.biomedcentral.com/1472-6947/13/112/abstract 
>   (check out the 
> supplemental material as well for additional details on running and 
> improvements)
> SVN (old, standalone, Scrubber v.3.x): 
> https://open.med.harvard.edu/wiki/display/SCRUBBER/Software 
> 
> SVN (initial apache port to ctakes sandbox): 
> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/ 
> 
> 
> The project started off as a standalone process and became a UIMA pipeline 
> (outside of ctakes).
> The plan had always been to port this to an optional ctakes module but we 
> never got that fully implemented.
> 
> Some of the parts that need the most attention to get going:
> working with the ctakes type system
> pulling out weka (ML lib) for an asf 2.0 friendly lib instead
> simpler process for building the models.
> 
> Regarding knowledge, its good to be familiar with java, UIMA, decision trees, 
> and ctakes. Likely in that order.
> 
> While this is still in the sandbox and you are still getting familiar with 
> running it as a standalone app feel free to ping me and andy off-list if 
> thats more convenient.
> Then we can definitely bring it back to the dev list while getting it running 
> in ctakes.
> 
> Cheers,
> 
> Britt
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com 
> britt.fi...@wiredinformatics.com 
> 
>> On Mar 20, 2015, at 7:57 PM, andy mcmurry > > wrote:
>> 
>> Britt et al: here is a student named rohit interested in getting the
>> deidentification pipeline running again. Hoping there is still interest in
>> getting this going in ctakes for real. Comments?
>> -- Forwarded message --
>> From: "Rohit Shinde" > >
>> Date: Mar 20, 2015 5:02 AM
>> Subject: Re: Medical de-identification
>> To: "andy mcmurry" mailto:mcmurry.a...@gmail.com>>
>> Cc:
>> 
>> I would certainly be interested into "production grade code". The project
>> also sounds interesting. How do I start working on it? I know Java well.
>> What else would I need to know before starting on this project?
>> 
>> On Fri, Mar 20, 2015 at 12:44 PM, andy mcmurry > >
>> wrote:
>> 
>>> Yes, the project is in Java, the code was written for a research project
>>> and never made into "production grade code". If you are interested, we
>>> would like to turn the scrubber into a solid pipeline. Java programming
>>> 100%, with Colt statistical library
>>> On Mar 19, 2015 7:52 PM, "Rohit Shinde" >> >
>>> wrote:
>>> 
 Hi Andy,
 
 Could you please tell me more about that project? I would really like a
 reply.
 
 Thank you,
 Rohit Shinde
 
 On Wed, Mar 18, 2015 at 5:51 PM, Rohit Shinde <
 rohit.shinde12...@gmail.com > wrote:
 
> Hi Andy,
> 
> I am interested in medical de-identification. I would like to know what
> this project consists of. Is it partially implemented, or does the
> implementation need to start?
> 
> What languages would I need to kn

Re: Dependency Parser model data

2015-03-24 Thread Ephi
After some more research, it seems that you can run the CPE to train a
model in version 2.0 of cTAKES but this doesn't work in cTAKES 3.2.

Many files are missing, but more importantly it seems that the model format
has changed. In version 2.0 the model format was the basic liblinear/libsvm
format. In 3.2, the format is the slightly modified clearNLP format which
uses strings for features and labels instead of numbers.

So, assuming that we want to train our own data, we would like to be able
to convert a .dep file into a feature file in the clearNLP format, which we
could then convert to the liblinear format, trained in liblinear, and the
resulting model could be converted back to the clearNLP format and passed
to the predictor.

So is this possible? Is there a way to translate a dependency tree file
into a clearNLP file?

Thanks, Ephi










On Mon, Mar 16, 2015 at 6:46 AM, Pei Chen  wrote:

> Ephi,
> The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
> contain much more.  They should contain at least MiPACQ and SHARP training
> data.  Could you point us to the documentation so we can update it?  I
> believe the break down was:
>
>
>- Clinical questions: 1,600 sentences, 30,138 tokens.
>- Medpedia articles: 2,796 sentences, 49,922 tokens.
>- MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
>- MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
>- Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
>- Seattle group health pathological notes: 2,294 sentences, 34,384
>tokens.
>- SHARP clinical notes: 6,787 sentences, 94,205 tokens.
>- SHARP stratified: 4,316 sentences, 43,037 tokens.
>- SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
>- TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
>- TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.
>
> There are some discussions on appending/augmenting the existing
> annotated/training data[2].  I think the short answer is that there is
> currently no easy way short of having to sign DUA's from every single
> source institution.
>
> [1] http://svn.apache.org/r1465043
> [2]
>
> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3ce5a9fa5abbf1ca4085d4f0794852a51e24241...@chexmbx3a.chboston.org%3E
>
>
> On Sun, Mar 15, 2015 at 11:58 AM, Ephi  wrote:
>
> > Hi -
> >
> > From the documentation, the data used to train the dep parser in cTAKES
> > seems to be 1600 clinical questions (from the Mayo clinic?).
> >
> > Is there a way to retrieve this data in order to retrain the model (while
> > adding on additional data) ?
> >
> > Thanks!
> > Ephi
> >
>