Absolutely. I’ll create it now. Thanks!
Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com britt.fi...@wiredinformatics.com > On Jul 9, 2015, at 3:12 PM, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > Hi Britt, > > I’ve got some code and tests to check in. Would you like to write the jira > item? > > From: britt fitch [mailto:britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com>] > Sent: Thursday, July 09, 2015 8:55 AM > To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org> > Subject: Re: dictionary-look-fast fails to handle alternative CUIs > > I don’t think that is too much of a constraint, at least initially, to have > all CUI values a consistent length for a given prefix. > > Thanks Sean, let me know if there is any part of this you’d like a hand with. > > Cheers, > > Britt > > > > > > > > > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com > britt.fi...@wiredinformatics.com > > On Jul 8, 2015, at 7:16 PM, Finan, Sean <sean.fi...@childrens.harvard.edu > <mailto:sean.fi...@childrens.harvard.edu><mailto:sean.fi...@childrens.harvard.edu > <mailto:sean.fi...@childrens.harvard.edu>>> wrote: > > Hi Britt, > > You’ve got it exactly. > > I actually started working on this right before a meeting right before I left > work right before I went to the store … but I’m now back to it and I’m going > to move forward with the tiny bot that I’ve got. I don’t think that it will > take too long … > > One reason that I like the “pair” idea is that something like “CN123456” > won’t get converted to “CN0123456” by assuming that it is a seven digit > numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, > SEAN02, SEAN03…” through 99. Then their output would still be formatted as > “SEAN01 .. SEAN99”. They couldn’t mix in “SEAN1, SEAN2 …” though. Is that > too much of a restraint? Hmmm. Well, I’m going to push forward with this > idea. > > I’ll check in whatever I get done tonight. > > Cheers, > Sean > > > From: britt fitch [mailto:britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com>] > Sent: Wednesday, July 08, 2015 4:21 PM > To: dev@ctakes.apache.org > <mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org > <mailto:dev@ctakes.apache.org>> > Subject: Re: dictionary-look-fast fails to handle alternative CUIs > > Thanks for the details Sean. I had assumed the conversion to Long was related > to sort/search efficiency but that makes sense. > > I had been thinking of something similar with parsing out the non-numerals > and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. > Ultimately CN123456 would become 0314123456 but I don’t think its > sophisticated enough to avoid issues with leading zeros. We could prepend a 9 > to it to avoid losing digits and use something like: > > if(length>8 && begins with 9) > discard 9 > while (length > 8) > convert first 2 numbers to a letter > > I think your suggestion sounds good to me. To run the example through: > > “NLM300" gets parsed to “NLM” + “300” > Store Pair<Integer,String>(3, NLM) at Pair[0] > Produce a Long of 0x10000000 + 300 = 300L > Backtrack to the actual “CUI” floor(300/10000000) = 0L > 300L - 0L = 300L > Pair[0] = NLM > CUI = NLM + 300 > > In that case, do we need to store it as a Pair at all or is just storing the > prefix in a String[] sufficient? > > I’m happy to start working on this unless you have a preference for splitting > it out into multiple tasks. > > > > > > > > > > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com <http://wiredinformatics.com/> > britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com > > <mailto:britt.fi...@wiredinformatics.com>><mailto:britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com>> > > On Jul 8, 2015, at 2:54 PM, Finan, Sean <sean.fi...@childrens.harvard.edu > <mailto:sean.fi...@childrens.harvard.edu><mailto:sean.fi...@childrens.harvard.edu > > <mailto:sean.fi...@childrens.harvard.edu>><mailto:sean.fi...@childrens.harvard.edu > <mailto:sean.fi...@childrens.harvard.edu>>> wrote: > > By the way, in case you are wondering why it does this … the umls database > that we use has roughly half a million cuis. Storing cuis in the various > tables as longs takes up a lot less space than storing them as 8 character > strings. > > From: britt fitch [mailto:britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com>] > Sent: Wednesday, July 08, 2015 2:23 PM > To: dev@ctakes.apache.org > <mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org > <mailto:dev@ctakes.apache.org>><mailto:dev@ctakes.apache.org > <mailto:dev@ctakes.apache.org>> > Subject: dictionary-look-fast fails to handle alternative CUIs > > This is largely directed to Sean but open to other feedback as well. > > The current fast lookup using a BSV parses the first field as “C” and up to 7 > numerals, padding with “0" as needed to reach that length when applicable > [see CuiCodeUtil.getCuiCode(String)] > > The CUI string is then substring’d from 1 to len and parsed as a Long. > > This is producing issues with other related, but separate, ontologies > (MedGen) where the bulk of concepts use UMLS CUIs but some additional > concepts were created by the NCBI where no CUI previously existed. > These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, > resulting in “N123456” failing to produce a Long. > > I wanted Sean’s thoughts on this and to get some feedback on if others are > running into this issue and if the community wants a solution to providing a > CUI format beyond the standard C + 7 numerals. > > I’m happy to make these edits and check them in whether that means updating > the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats > what makes the most sense. > > Thoughts? > > > > > > > > > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com<http://wiredinformatics.com/> > <http://wiredinformatics.com<http://wiredinformatics.com/>> > britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com > > <mailto:britt.fi...@wiredinformatics.com>><mailto:britt.fi...@wiredinformatics.com > <mailto:britt.fi...@wiredinformatics.com>>
signature.asc
Description: Message signed with OpenPGP using GPGMail