I'm using Lucene 4.10.3. (I plan to upgrade soon but need to fix an issue
on this version today).
I switched a Lucene index from using string document ids to byte arrays.
The problem I'm having is that the system no longer finds documents by
their id. I *suspect* this is because the lucene code is
Hi
I need a special kind of 'token' which is a sentence, so I need a
tokenizer that splits texts into sentences.
I wonder if there is already such or similar implementations?
If I have to implement it myself, I suppose I need to implement a
subclass of Tokenizer. Having looked at a few exist
Sentence recognition is usually an NLP problem. Probably best handled
outside of Solr. For example, you probably want to train and run a sentence
recognition algorithm, inject a sentence delimiter, then use that delimiter
as the basis for tokenization.
More info on sentence recognition
http://open
You are indexing your field as a StoredField which means it's not
actually indexed (just stored), so no query (nor IW.deleteDocument)
will ever be able to find it.
Try StringField instead ... in recent versions you can pass a BytesRef
value to that.
Mike McCandless
http://blog.mikemccandless.com
Thanks that is understood.
My application is a bit special in the way that I need both an indexed
field with standard tokenization and an unindexed but stored field of
sentences. Both must be present for each document.
I could possibly do with PatternTokenizer, but that is of course, less
ac
i upgraded to 5.3 and fixed as you suggested. thank you.
On Wed, Sep 23, 2015 at 11:31 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> You are indexing your field as a StoredField which means it's not
> actually indexed (just stored), so no query (nor IW.deleteDocument)
> will ever b
Hi Ziqi,
Lucene has support for sentence chunking - see SegmentingTokenizerBase,
implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in
that class’s tests that creates tokens out of individual sentences:
TestSegmentingTokenizerBase.WholeSentenceTokenizer.
However, it
Thanks Steve.
It probably also makes sense to extract sentences and then store them.
But along with each sentence i also need to store its start/end offset.
I'm not sure how to do that without creating a separate index that
stores each sentence as a document? Basically the field for sentence a
Further to this problem, I have created a custom tokenizer but I cannot
get it loaded properly by solr.
The error stacktrace:
Exception in thread "main" org.apache.solr.common.SolrException:
SolrCore 'myproject' is not available due to init failure: Could not
load c
Unless you need to be able search on sentences-as-terms, i.e. exact sentence
matching, you should try to find an alternative; otherwise your term index will
be unnecessarily huge.
Three things come to mind:
1. A single Lucene index can host mixed document types, e.g. full documents and
sentenc
10 matches
Mail list logo