Hi Sean, do you want a ticket for the PTB update? Cheers,
Britt Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com britt.fi...@wiredinformatics.com > On Jul 15, 2015, at 9:07 AM, britt fitch <britt.fi...@wiredinformatics.com> > wrote: > > Thanks Sean. > > The other part of the concern is if its reasonable/feasible to alter > tokenization rules for things like gene locations. I can work around this in > a few ways but if there are other examples of how this might come up in other > cases it could be worth looking at a blanket change. Sadly I don’t have > another example off the top of my head, maybe organism names? Doing a few > queries for terms in the UMLS with periods the majority of them seem to be > things you really would want to split on. Perhaps genes are just an edge case. > > I was looking at gene locations overall, not any particular gene or disorder > grouping. The term I mentioned was just meant to be an example. > > > Britt Fitch > Wired Informatics > 265 Franklin St Ste 1702 > Boston, MA 02110 > http://wiredinformatics.com <http://wiredinformatics.com/> > britt.fi...@wiredinformatics.com > >> On Jul 15, 2015, at 8:57 AM, Finan, Sean <sean.fi...@childrens.harvard.edu >> <mailto:sean.fi...@childrens.harvard.edu>> wrote: >> >> Hi Britt, >> >> The dictionary should be using ptb tokenization, but I obviously missed a >> rule and separated the . from the following 2 in the dictionary. >> >> I will double-check everything. >> >> Sean >> >> p.s. if you don’t mind my asking, are you looking into all connective tissue >> disorders or just Shprintzen? >> >> From: britt fitch [mailto:britt.fi...@wiredinformatics.com >> <mailto:britt.fi...@wiredinformatics.com>] >> Sent: Tuesday, July 14, 2015 3:58 PM >> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org> >> Subject: periods and the interaction with PTB & Fast Dict Lookup. >> >> Another question/topic likely for Sean & Tim. Happy to get others’ feedback >> as well. >> >> I am trying to identify gene related information. >> >> It appears that the PTB tokenization logic in places like the tokenizer & >> dictionary building will split a string into multiple tokens if it is not a >> number and contains a period. >> >> For example, given “22q11.2 deletion syndrome”: >> >> PTB tokenizer: [22q11, .2, deletion, syndrome] >> POS for the above term: [CD, CD, NN, NN] >> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP] >> >> The same string creates a different split of [22q11, ., 2, deletion, >> syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens) >> When the _rareWordTermMap gets created it uses the first token as the key: >> 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d] >> >> The period-split difference above (period alone vs period + number) might be >> irrelevant here because for the input “22q11.2 deletion syndrome”, the >> lookup indices are [2,3]. >> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” >> because its a number. >> >> It looks like this concept might not be possible to be identified unless CD >> is allowed as a lookup token POS. >> Even if this is allowed though, in the case of gene locations I think the >> PTB rules might not be the best fit. >> >> Are there any thoughts/experiences regarding addressing the gene location >> mentions like this? >> Should the Fast Dict tokenization logic match the PTB tokenizer logic to >> produce the same components? >> >> Let me know if I read into one of these points wrong. Since these items >> would likely cause large changes I am looking to get some feedback before >> moving forward. >> >> Cheers, >> >> Britt >> >> >> >> >> >> >> >> >> >> Britt Fitch >> Wired Informatics >> 265 Franklin St Ste 1702 >> Boston, MA 02110 >> http://wiredinformatics.com <http://wiredinformatics.com/> >> britt.fi...@wiredinformatics.com >> <mailto:britt.fi...@wiredinformatics.com><mailto:britt.fi...@wiredinformatics.com >> <mailto:britt.fi...@wiredinformatics.com>> >
signature.asc
Description: Message signed with OpenPGP using GPGMail