Hi,
I have documents in different languages and I want to choose the
tokenizer to use for a document based on the language of the document. The
language of the document is already known and is indexed in a field. What I
want to do is when I index the text in the document, I want to choose the
tokenizer to use based on the value of the language field. I want to use one
field for the text in the document (defining multiple fields for each language
is not an option). It seems like I can define a tokenizer for a field, so I
guess what I need to do is to write a custom tokenizer that looks at the
language field value of the document and calls the appropriate tokenizer for
that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK
languages etc..). From whatever I have read, it seems quite straight forward to
write a custom tokenizer, but how would this custom tokenizer know the language
of the document? Is there some way I can pass in this value to the tokenizer?
Or is there some way the tokenizer will have access to other fields in the
document?. Would be really helpful if someone can provide an answer
Thanks
Prabhu