Hi Ziqi,

Lucene has support for sentence chunking - see SegmentingTokenizerBase, 
implemeented in ThaiTokenizer and HMMChineseTokenizer.  There is an example in 
that class’s tests that creates tokens out of individual sentences: 
TestSegmentingTokenizerBase.WholeSentenceTokenizer.  

However, it sounds like you only need to store the sentences, not search 
against them, so I don’t think you need sentence *tokenization*.

why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do 
sentence splitting and add to the doc as stored fields?

Steve
www.lucidworks.com

> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> wrote:
> 
> Thanks that is understood.
> 
> My application is a bit special in the way that I need both an indexed field 
> with standard tokenization and an unindexed but stored field of sentences. 
> Both must be present for each document.
> 
> I could possibly do with PatternTokenizer, but that is of course, less 
> accurate than e.g., wrapping OpenNLP sentence splitter in a lucene Tokenizer.
> 
> 
> 
> On 23/09/2015 16:23, Doug Turnbull wrote:
>> Sentence recognition is usually an NLP problem. Probably best handled
>> outside of Solr. For example, you probably want to train and run a sentence
>> recognition algorithm, inject a sentence delimiter, then use that delimiter
>> as the basis for tokenization.
>> 
>> More info on sentence recognition
>> http://opennlp.apache.org/documentation/manual/opennlp.html
>> 
>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
>> wrote:
>> 
>>> Hi
>>> 
>>> I need a special kind of 'token' which is a sentence, so I need a
>>> tokenizer that splits texts into sentences.
>>> 
>>> I wonder if there is already such or similar implementations?
>>> 
>>> If I have to implement it myself, I suppose I need to implement a subclass
>>> of Tokenizer. Having looked at a few existing implementations, it does not
>>> look very straightforward how to do it. A few pointers would be highly
>>> appreciated.
>>> 
>>> Many thanks
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 
>> 
> 
> 
> -- 
> Ziqi Zhang
> Research Associate
> Department of Computer Science
> University of Sheffield
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to