better?
Said that, I'd love to hear more specific requirements about Tokenizer to
avoid the above odd deliveries :)
regards
Valery
--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25078755.html
Sent from the Lucene
Simon Willnauer wrote:
>
> you could do
> the whole job in a Tokenizer but this would not be a good separation
> of concerns right!?
>
right, it wouldn't be a good separation of concerns.
That's why I wanted to know what you consider as "Tokenizer's job".
--
View this message in context:
thin the token filter
> [...]
>
I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"
Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)
regards
Valery
--
View this message
Hi Simon,
Simon Willnauer wrote:
>
> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
> [...]?!
>
> simon
>
yes, I did, please find the info in the initial message. Here are the
excerpts:
Valery wrote:
>
> 2) WhitespaceTokenizer gives me
", "R/3", "SAP R/3"} later.
I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.
What do you think, Robert?
maybe even both?..
regards,
Valery
Ken Krugler wrote:
>
> Hi Valery,
>
> From our experience at Krugle, we wound up having to create our own
> tokenizers (actually kind of specialized parser) for the different
> languages. It didn't seem like a good option to try
Hi Robert,
thanks for the hint.
Indeed, a natural way to go. Especially if one builds a Tokenizer of the
level of quality like StandardTokenizer's.
OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
this task?..
regards
Valery
Robert Muir wrote:
>
d also react on
delimiting characters and emit the token. However, it should distinguish
between delimiters like whitespaces along with ";,?" and the delimiters like
"./&".
Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
Lexem