Hi Arslan,
UAX29URLEmailTokenizerImpl.jflex includes ASCIITLD.jflex-macro, which has this
at the end:
> ) "."? // Accept trailing root (empty) domain
So trailing dots are recognized as part of domains that are included in URLs
and email addresses. But maybe they shouldn’t be? (Except maybe
About what you see with ICU: it is correct, you have to make sure you
handle "Common":
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java
It mostly behaves like
http://icu-project.org/apiref/icu4j/com/ibm/
Hi,
I extracted Emails and URLs from certain TREC collections using
TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
High Freq. terms reveal that * some e-mail addressed start with apostrophes *
some e-mails or URLs end with a period.
I ran a few tests and this behaviour occurs only i