we Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
-Original Message-
From: Ziqi Zhang [mailto:ziqi.zh...@sheffield.ac.uk]
Sent: Saturday, October 03, 2015 5:01 PM
To: java-user@lucene.apache.org
Subject: lucene deliberately removes \r (windows car
o its applied on system startup.
This depends on your Linux distribution, we cannot give any help on this.
I would also recommend to review my blog post as stated with URL in
the exception message!
Uwe
Am 1. Oktober 2015 21:25:30 MESZ, schrieb Ziqi Zhang
:
Hi,
I have a problem wh
Hi
I am trying to pin-point a mismatch between the offsets produced by
lucene indexing process when I use the offsets to substring from the
original document content.
I try to debug as far as I can go but I lost track of lucene when I am
at line 298 of DefaultIndexingChain (lucene 5.3.0):
Hi,
I have a problem which I think is the same as that described here:
http://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed
However the solution does not apply in this case so I am providing more
details and asking again.
The index is created using Solr 5.3
Hi
Is there a way to remove just the leading and trailing stopwords from a
token n-gram?
Currently I have the following combination which removes any n-gram that
contains a stopword:
ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
ago
sentences (stored): 33,52|Something happened.
Steve
www.lucidworks.com
On Sep 23, 2015, at 3:26 PM, Ziqi Zhang wrote:
Thanks Steve.
It probably also makes sense to extract sentences and then store them. But
along with each sentence i also need to store its start/end offset. I
se.WholeSentenceTokenizer.
However, it sounds like you only need to store the sentences, not search
against them, so I don’t think you need sentence *tokenization*.
why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do
sentence splitting and add to the doc as stored fields?
Steve
www
P) to do
sentence splitting and add to the doc as stored fields?
Steve
www.lucidworks.com
On Sep 23, 2015, at 11:39 AM, Ziqi Zhang wrote:
Thanks that is understood.
My application is a bit special in the way that I need both an indexed field
with standard tokenization and an unindexed but s
a sentence delimiter, then use that delimiter
as the basis for tokenization.
More info on sentence recognition
http://opennlp.apache.org/documentation/manual/opennlp.html
On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang
wrote:
Hi
I need a special kind of 'token' which is a sentence, s
Hi
I need a special kind of 'token' which is a sentence, so I need a
tokenizer that splits texts into sentences.
I wonder if there is already such or similar implementations?
If I have to implement it myself, I suppose I need to implement a
subclass of Tokenizer. Having looked at a few exist
Hi
Given a document in a lucene index, I would like to get a list of terms
in that document and their offsets. I suppose starting with
IndexReader.getTermVector can get me going with this. I have some code
as below (Lucene 5.3) of which I have some questions:
, you need to enable term vectors during indexing. The
pattern how to use terms enum can be looked up at various places in Lucene
source code. It's a very expert API but it is the way to go here.
Uwe
Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang
:
Hi
Is it possible to get a li
Hi
Is it possible to get a list of terms within a document, and also TF of
each of these terms *in that document only*? (Lucene 5.3)
IndexReader has a method "Terms getTermVector(int docID, String field)",
which gives me a "Terms" object, on which I can get a TermsEnum. But I
do not know whe
Hi
I am using the TermsComponent in my solr config like this to deal with
queries about terms in the index:
--
true
false
terms
---
For example, I want to fetch any *terms* containing "surface defects".
Using solr I can d
14 matches
Mail list logo