Re: tokenize into sentences/sentence splitter

Alessandro Benedetti Thu, 24 Sep 2015 02:07:09 -0700

mmmm reading this : " *unindexed* but stored field of sentences"

Unindexed immediately points to me to the fact you actually do not need a
tokeniser at all.
Just run an external sentence splitter ( in your indexing application), and
store the sentences as different values for a stored field.
Why this is not going to work for you ?


Cheers

2015-09-24 9:39 GMT+01:00 Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>:

> Thanks for the comprehensive explanation, I think option 3 best fit my app.
>
>
>
>
> On 23/09/2015 22:53, Steve Rowe wrote:
>
>> Unless you need to be able search on sentences-as-terms, i.e. exact
>> sentence matching, you should try to find an alternative; otherwise your
>> term index will be unnecessarily huge.
>>
>> Three things come to mind:
>>
>> 1. A single Lucene index can host mixed document types, e.g. full
>> documents and sentences.
>>
>> 2. Nested documents, in Lucene's join module, could help, depending on
>> what you need to do.  Parent documents could correspond to original full
>> documents, and sentences could be stored fields in child documents.  The
>> sentence offsets could be separate child document fields, maybe also
>> stored-only, depending on search/sort requirements.  See <
>> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>,
>> <
>> http://lucene.apache.org/core/5_3_0/join/org/apache/lucene/search/join/ToParentBlockJoinQuery.html>
>> the tests for ToParentBlockJoinQuery for example usages: <
>> http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_3_0/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java
>> >.
>>
>> 3. Or, more simply, just store the offsets as a prefix inline with the
>> stored sentence field values, e.g.
>>
>>      original text: Four score and seven years ago....  Something
>> happened.
>>      words (indexed): Four, score, and, seven, years, ago, Something,
>> happened
>>      sentences (stored): 0,31|Four score and seven years ago....
>>      sentences (stored): 33,52|Something happened.
>>
>> Steve
>> www.lucidworks.com
>>
>> On Sep 23, 2015, at 3:26 PM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
>>> wrote:
>>>
>>> Thanks Steve.
>>>
>>> It probably also makes sense to extract sentences and then store them.
>>> But along with each sentence i also need to store its start/end offset. I'm
>>> not sure how to do that without creating a separate index that stores each
>>> sentence as a document? Basically the field for sentence and the field for
>>> terms should be in the same index.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On 23/09/2015 19:08, Steve Rowe wrote:
>>>
>>>> Hi Ziqi,
>>>>
>>>> Lucene has support for sentence chunking - see SegmentingTokenizerBase,
>>>> implemeented in ThaiTokenizer and HMMChineseTokenizer.  There is an example
>>>> in that class’s tests that creates tokens out of individual sentences:
>>>> TestSegmentingTokenizerBase.WholeSentenceTokenizer.
>>>>
>>>> However, it sounds like you only need to store the sentences, not
>>>> search against them, so I don’t think you need sentence *tokenization*.
>>>>
>>>> why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to
>>>> do sentence splitting and add to the doc as stored fields?
>>>>
>>>> Steve
>>>> www.lucidworks.com
>>>>
>>>> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
>>>>> wrote:
>>>>>
>>>>> Thanks that is understood.
>>>>>
>>>>> My application is a bit special in the way that I need both an indexed
>>>>> field with standard tokenization and an unindexed but stored field of
>>>>> sentences. Both must be present for each document.
>>>>>
>>>>> I could possibly do with PatternTokenizer, but that is of course, less
>>>>> accurate than e.g., wrapping OpenNLP sentence splitter in a lucene
>>>>> Tokenizer.
>>>>>
>>>>>
>>>>>
>>>>> On 23/09/2015 16:23, Doug Turnbull wrote:
>>>>>
>>>>>> Sentence recognition is usually an NLP problem. Probably best handled
>>>>>> outside of Solr. For example, you probably want to train and run a
>>>>>> sentence
>>>>>> recognition algorithm, inject a sentence delimiter, then use that
>>>>>> delimiter
>>>>>> as the basis for tokenization.
>>>>>>
>>>>>> More info on sentence recognition
>>>>>> http://opennlp.apache.org/documentation/manual/opennlp.html
>>>>>>
>>>>>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <
>>>>>> ziqi.zh...@sheffield.ac.uk>
>>>>>> wrote:
>>>>>>
>>>>>> Hi
>>>>>>>
>>>>>>> I need a special kind of 'token' which is a sentence, so I need a
>>>>>>> tokenizer that splits texts into sentences.
>>>>>>>
>>>>>>> I wonder if there is already such or similar implementations?
>>>>>>>
>>>>>>> If I have to implement it myself, I suppose I need to implement a
>>>>>>> subclass
>>>>>>> of Tokenizer. Having looked at a few existing implementations, it
>>>>>>> does not
>>>>>>> look very straightforward how to do it. A few pointers would be
>>>>>>> highly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Many thanks
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>> Ziqi Zhang
>>>>> Research Associate
>>>>> Department of Computer Science
>>>>> University of Sheffield
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>> --
>>> Ziqi Zhang
>>> Research Associate
>>> Department of Computer Science
>>> University of Sheffield
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> --
> Ziqi Zhang
> Research Associate
> Department of Computer Science
> University of Sheffield
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: tokenize into sentences/sentence splitter

Reply via email to