[jira] [Commented] (LUCENE-5558) Add TruncateTokenFilter

Ahmet Arslan (JIRA) Fri, 28 Mar 2014 11:03:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951148#comment-13951148
 ]


Ahmet Arslan commented on LUCENE-5558:
--------------------------------------

Following declarations  does not throw an Exception but no token survives from 
them. It is unusual ( and weird) that there is no surviving tokens. What do you 
think about TestRandomChains detects empty token stream at the end?

Should these filters validate their integer arguments?

{code:xml}
 <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="-10" 
consumeAllTokens="false" />
{code}

{code:xml}
 <filter class="solr.LengthFilterFactory" min="-5" max="-1" />
{code}

{code:xml}
 <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="-3" />
{code}

> Add TruncateTokenFilter
> -----------------------
>
>                 Key: LUCENE-5558
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5558
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.7
>            Reporter: Ahmet Arslan
>            Priority: Minor
>              Labels: Turkish, f5
>             Fix For: 4.8
>
>         Attachments: LUCENE-5558.patch, LUCENE-5558.patch, LUCENE-5558.patch
>
>
> I am using this filter as a stemmer for Turkish language. In many academic 
> research (classification, retrieval) it is used and called as Fixed Prefix 
> Stemmer or Simple Truncation Method or F5 in short.
> Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish 
> language in [Information Retrieval on Turkish 
> Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf]. It is 
> the same work where most of stopwords_tr.txt are acquired. 
> ElasticSearch has 
> [truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html]
>  filter but it does not respect keyword attribute. And it has a use case 
> similar to TruncateFieldUpdateProcessorFactory
> Main advantage of F5 stemming is : it does not effected by the meaning loss 
> caused by ascii folding. It is a diacritics-insensitive stemmer and works 
> well with ascii folding. [Effects of diacritics on Turkish information 
> retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]
> Here is the full field type I use for "diacritics-insensitive search" for 
> Turkish
> {code:xml}
>  <fieldType name="text_tr_ascii_f5" class="solr.TextField" 
> positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.ApostropheFilterFactory"/>
>      <filter class="solr.TurkishLowerCaseFilterFactory"/>
>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>      <filter class="solr.KeywordRepeatFilterFactory"/>
>      <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    </analyzer>
> {code}
> I  would like to get community opinions :
> 1) Any interest in this? 
> 2) keyword attribute should be respected? 
> 3) package name analysis.misc versus analyis.tr 
> 4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5558) Add TruncateTokenFilter

Reply via email to