Choosing tokenizer based on language of document

Prakashganesh, Prabhu Wed, 04 Apr 2012 05:30:15 -0700

Hi,
      I have documents in different languages and I want to choose the 
tokenizer to use for a document based on the language of the document. The 
language of the document is already known and is indexed in a field. What I 
want to do is when I index the text in the document, I want to choose the 
tokenizer to use based on the value of the language field. I want to use one 
field for the text in the document (defining multiple fields for each language 
is not an option). It seems like I can define a tokenizer for a field, so I 
guess what I need to do is to write a custom tokenizer that looks at the 
language field value of the document and calls the appropriate tokenizer for 
that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK 
languages etc..). From whatever I have read, it seems quite straight forward to 
write a custom tokenizer, but how would this custom tokenizer know the language 
of the document? Is there some way I can pass in this value to the tokenizer? 
Or is there some way the tokenizer will have access to other fields in the 
document?. Would be really helpful if someone can provide an answer


Thanks
Prabhu

Choosing tokenizer based on language of document

Reply via email to