Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

2017-08-14 Thread Steve Rowe
Hi Arslan, UAX29URLEmailTokenizerImpl.jflex includes ASCIITLD.jflex-macro, which has this at the end: > ) "."? // Accept trailing root (empty) domain So trailing dots are recognized as part of domains that are included in URLs and email addresses. But maybe they shouldn’t be? (Except maybe

Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

2017-08-12 Thread Robert Muir
About what you see with ICU: it is correct, you have to make sure you handle "Common": https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java It mostly behaves like http://icu-project.org/apiref/icu4j/com/ibm/

TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

2017-08-12 Thread Ahmet Arslan
Hi, I extracted Emails and URLs from certain TREC collections using TestUAX29URLEmailTokenizer combined with TypeTokenFilter. High Freq. terms reveal that  * some e-mail addressed start with apostrophes  * some e-mails or URLs end with a period.  I ran a few tests and this behaviour occurs only i