Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-02 Thread Steve Rowe
Paul, You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes. This is intended to be extensible per script. The root br

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 18:42, Steve Rowe wrote: Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. Yeah sure, I did try this and hit a

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Steve Rowe
Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Michael McCandless
I played with this possibility on the extremely experimental https://issues.apache.org/jira/browse/LUCENE-5012 which I haven't gotten back to for a long time... The changes on that branch adds the idea of a "deleted token", by just setting a new DeletedAttribute marking whether the token is delete

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 08:08, Dawid Weiss wrote: Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively eas

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively easy) to create an analyzer (or a modificatio

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Steve Rowe
Hi Paul, StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: . Only those

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
Yes, most special characters are treated as term delimiters, except that underscores, dots, and commas have some special rules. See the details under Standard Tokenizer in my Solr e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-2120354