Re: Proposed improvements [EXTERNAL]

Hadrian Zbarcea Wed, 28 Jun 2017 08:20:02 -0700

Ok, so:

* I tracked the source code for the 2016 version [1] of lvg we use. Thesource is included in the (almost) 1G .tgz, didn't check the lite version.* There is a newer 2017 version [2]. I don't know if the community wantsto upgrade.* I looked at the code, and I will refrain from any comment now, but onething is clear: no way this code will work in OSGi.

Sean, I vastly underestimated the work required to achieve what I hoped.I will not back off, but there's a lot of work to be done and I am noteven sure where to start yet. To your comment re: OSGi, the issue isthat there are too many constraints embedded in the code, dependency onfile system, embedded database, etc. In my opinion the biggest bang forthe buck would come from cleaning up the architecture and dependencystructure, make it more loosely coupled. I'd be happy to work with (andlearn from) you on it.


Cheers,
Hadrian

[1]https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2016/web/download.html[2]https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/web/release/index.html



On 06/27/2017 07:36 PM, Hadrian Zbarcea wrote:

Speaking of lvg. Does anybody know where the source code forlvgdist-2016.0.jar is?
Thanks,
Hadrian



On 06/26/2017 11:04 AM, Finan, Sean wrote:
Hi Andrey,

Thank you for the input.  Thank you also Hadrian.
With regard to a smaller ctakes, I know that a couple of people(including yours truly) are currently working on trimming some fat. Afew areas have been targeted, with the old/huge umls dictionary beingat the top of the list. It is deprecated and only used in a fewtests. Lvg is also used in a few test configurations, but I am unsureof its necessity.
As far as a "ctakes core" ... I have been trying to figure out a smartway to separate the default clinical pipeline modules from others,making the others optional. I already have a pom for clinical thatdoes not include relation, temporal, coref, very importantly ytex ...as those are not part of the default clinical pipeline. One thingthat has me halted is figuring out how and where to make a simplemechanism for people to grab the more advanced modules. A while ago Iput a project pom in sandbox under "ctakes the api" or something tothat effect. It is basically a pom with advanced modules commentedout. A developer could start with that pom as their project main,then uncomment modules as needed. It was a first ten-minute attemptat something simple and, while worth a try, not an ideal solution.
Another idea that I have been tossing around is separating tests intoseparate modules. Also possibly "training" into separate modules. Itis standard practice to keep parallel src/ and test/ directories in arepository and this kind of follows that thinking. Many of the tests(such as mentioned above) require/use modules and resources that arenot actually required to build the source. The same goes for possibleexamples. I think that the same could be true for training - if notnow, perhaps in the future. Again, I am held up on the best way toactually do this, keeping things simple wrt maven and a lack of excesscomplexity. The last thing that I want to do is make ctakes moredifficult to use.
Maybe osgi can help the above, but I'm honestly not sure how. Ifanybody else thinks that it can then I am going to let them handleit. Perhaps I am just jaded. Years ago my previous company had greathopes for osgi and invested a lot of time (=money) into applying it toour applications. Over a million dollars later, the consensus wasthat osgi couldn't apply to our applications without completelyrewriting infrastructure - which was an absolute no-go - and even ifit could just be slapped on overnight did nothing for us or ourcustomers.
With regard to better logging, I think that James added some moredetailed logging for the 4.0 release, and I think that he has a fewmore areas slated. There are more logging statements that exist atfiner levels than "info" and can be seen by changing the log4jconfiguration. As for changing the entire codebase to slf4j, I may bealone but I'm not sure how that alone will make ctakes any moretransparent.
With regard to ctakes setup having some quirks ... yup. Known issueto a lot of us. Documentation was improved for the 4.0 release, but"run anywhere" documentation is difficult to both create andmaintain. Several ideas have been tossed around includinginstallation scripts, an "environment/setupinvestigation/confirmation" gui or something like a running faq/blogon nothing but installation problems and solutions.
Sean

-----Original Message-----
From: Andrey Kurdumov [mailto:[email protected]]
Sent: Sunday, June 25, 2017 1:52 AM
To: cTakes developers list
Subject: Re: Proposed improvements [EXTERNAL]
Just want to note that ASF PMC want to make GitHub primary repositoryand Apache servers secondary soon.
Regarding improvements:
I personally want better support for embedding. Right now cTakesdistribution comes with LVG and UMLS dictionary and size of cTakesthus become very.I would like to have (and work on it) much leaner distribution, let'sname it cTakes Core, which will just provide cTakes executable withoutneed for data.Right now I have constantly rip-off that data after cTakes build whichslow down my build significantly.
Personally I support Hadrian initiative to have better logging sincecTakes setup has some quirks which could be faster resolved by betterlogging.
2017-06-23 17:38 GMT+06:00 Miller, Timothy <
[email protected]>:
Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and
like something where we should be making people aware of cTAKES!

svn vs. git -- I'm with you on preferring git, but not by so much that
it's worth spending time on an argument if it turns into an argument
:). As far as I know we've never really had a discussion about it.
It's probably getting to the point where new developers have _only_
used git and would find it a complete roadblock to use svn but for me
it's just a mild annoyance.

All others you mentioned -- if you are willing to contribute a patch
we are happy to accept one-off contributions, and we are also
interested in growing the developer community with people who are
interested in contributing regularly over time.

Tim

________________________________________
From: Hadrian Zbarcea <[email protected]>
Sent: Thursday, June 22, 2017 9:14 PM
To: [email protected]
Subject: Proposed improvements [EXTERNAL]

Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
other projects) and the ASF in general.

I was surprised that most didn't know much about the ASF and more
importantly that nobody knew about cTakes, the only (directly)
healthcare related project at the ASF. There was no cTakes talk at
ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
should have had a presence. I will probably submit a talk for next
year, but until then, because I think I created a bit of interest in
cTakes I went to build cTakes myself and try a few things.

Some of my findings are:
* test failures with openjdk; granted the docs mention oracle jdk as a
prerequisite, but think it's easy to support openjdk
* use of svn vs git; this is a debatable topic, but by now everybody
and their uncles are on git so moving to git (which I'd recommend)
would probably forster adoption (yes, I know about the github mirror)
* no support for OSGi, many large players use it
* improvements in logging could go a long way, starting with moving to
slf4j

Suggesting improvements imply that I volunteer to do a good chunk of
the work, but before that I'm interested more in how much the
community would welcome such improvements. I am curious what are
considered more low hanging fruits, for the more controversial topics
we could take them to [discuss] threads. Because every community has
its own culture and I am not that familiar with the cTakes one,
although I went through the mail archives, I thought a prudent firststep would be to start with this.
Feedback appreciated,
Hadrian

Re: Proposed improvements [EXTERNAL]

Reply via email to