Another question/topic likely for Sean & Tim. Happy to get others’ feedback as 
well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & 
dictionary building will split a string into multiple tokens if it is not a 
number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] 
in the new dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 
22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be 
irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup 
indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” 
because its a number.

It looks like this concept might not be possible to be identified unless CD is 
allowed as a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB 
rules might not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location 
mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to 
produce the same components?

Let me know if I read into one of these points wrong. Since these items would 
likely cause large changes I am looking to get some feedback before moving 
forward.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to