Hello cTakes devs,
Before I start submitting pull requests, I wanted to introduce myself to
the team. My name is Jeffery Painter and I am currently working at
GlaxoSmithKline as a researcher in our AI/ML group for drug safety. I
have worked with the UMLS pretty extensively since around 2006-2007 and
was one of the co-developers of what became the OMOP Common Data Model
(CDM). You can see my full list of publications here:
https://javastats.com/about.html
I have been a long time contributor to the Apache Turbine and Apache
Torque projects, but have not participated with other Apache projects
directly, so not sure what the etiquette is for this group.
I had an itch to scratch as I had been using the old Perl
UMLS::Similarity module for years, and discovered about a year ago that
the ctakes-ytex package could potentially help solve these issues and I
have been able to wrangle it into producing what I need to support my
work. However, you probably all know this code has not really been
touched since 2013 from what I can tell.
I have been able to update the ctakes-ytex build process to now run with
modern versions of MySQL (using Ubuntu 23.10 and MySQL version 8.0.35)
which I would like to contribute back to the project. In addition, I
have found some computational "bugs" in a couple of the kernel metrics
in the ctakes-ytex package which I have now been able to correct and
match the outputs generated from the old Perl UMLS::Similarity package.
In addition, I've added a couple of metrics provided by the Perl module
which were not in the ctakes-ytex code (such as Resnik and Faith
algorithms).
I was going to propose submitting as 3 separate pull requests:
PR-1 : update build process to support modern MySQL database
Q's - is it appopriate to update the supplied MRCONSO.RRF and MRSTY.RRF
files from UMLS with current versions? How about me updating the
pre-built concept graphs as binary .gz files?
To support the MySQL connection, there are XML templates which parse the
DB connection and I don't have an elegant way to pass through the & vs
& so I had created two separate DB properties (one for direct JDBC
connections in the ctake-ytex code that can't work with the &
escaped version and another for the XML templates to parse)
PR-2 : submit corrections to the metrics which are broken
PR-3 : submit new metrics
Please let me know if this makes sense, and how best to work with your
team. Should I fork the github repo into my own or create a new branch
and submit the PRs from there? What are the preferred ways of working
and contributing to the cTakes project?
As I said, I have apache credentials already from my work with the
Turbine team since 2003, and I was elected a member a couple years ago :-)
Best,
Jeffery Painter
j...@jivecast.com
pain...@apache.org