phi.de
eMail: u...@thetaphi.de
> -Original Message-----
> From: Scott Smith [mailto:ssm...@mainstreamdata.com]
> Sent: Thursday, December 05, 2013 9:36 PM
> To: java-user@lucene.apache.org
> Subject: Analyzers aren't reusable?? (lucene 4.2.1)
>
> I wrote the following
D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-----
> From: Scott Smith [mailto:ssm...@mainstreamdata.com]
> Sent: Thursday, December 05, 2013 9:36 PM
> To: java-user@lucene.apache.org
> Subject: Analyzers aren't reusable?? (lucene 4.2.1)
>
> I
I wrote the following to demonstrate what for me was surprising behavior (this
is Lucene 4.2.1). If you want to run this yourself, you should be able to
since there are no references to anything other than standard lucene and java
libraries. Basically, this is an analyzer that makes everything
Never mind. I figured it out. Thanks anyway.
-Original Message-
From: Scott Smith [mailto:ssm...@mainstreamdata.com]
Sent: Wednesday, November 27, 2013 9:27 AM
To: java-user@lucene.apache.org
Subject: Highlighting phrases
I'm doing some highlighting with the following code fra
I'm doing some highlighting with the following code fragment:
formatter = new SimpleHTMLFormatter(,
);
Scorer score = new QueryScorer(myQuery);
ht = new Highlighter(formatter, score);
ht.setTextFragmenter(new NullFragmenter());
I'm doing some highlighting with the following code fragment:
formatter = new SimpleHTMLFormatter(,
);
Scorer score = new QueryScorer(myQuery);
ht = new Highlighter(formatter, score);
ht.setTextFragmenter(new NullFragmenter());
ounds like you either need to have a custom analyzer or a field-aware
analyzer.
-- Jack Krupansky
-Original Message-----
From: Scott Smith
Sent: Tuesday, September 17, 2013 4:26 PM
To: java-user@lucene.apache.org
Subject: Can you escape characters you don't want the analyzer to modify
S
Suppose I have a string like "ab@cd%d". My analyzer will turn this into "ab cd
d". Can I pass it "ab\@cd\%d" and force it to treat it as a single word? I
want to use the Query parser, but I don't want it messing with fields that have
not been analyzed.
I want to be sure I understand this correctly. Suppose I have a search that
I'm going to run through the query parser that looks like:
body:"some phrase" AND keyword:"my-keyword"
clearly "body" and "keyword" are field names. However, the additional
information is that the "body" field is anal
g the whole term in quotes. Otherwise the slash (even embedded in the
middle of a term!) indicates the start of a regex query term.
-- Jack Krupansky
-Original Message-
From: Scott Smith
Sent: Sunday, May 19, 2013 2:50 PM
To: java-user@lucene.apache.org
Subject: classic.QueryParser - bug o
I just upgraded from lucene 4.1 to 4.2.1. We believe we are seeing some
different behavior.
I'm using org.apache.lucene.queryparser.classic.QueryParser. If I pass the
string "20110920/EXPIRED" (w/o quotes) to the parser, I get:
org.apache.lucene.queryparser.classic.ParseException: Cannot pars
hy on earth do you set: lbsm.setMaxMergeDocs(10);
if you have 10 docs in a segment you don't want to merge anymore? I don't think
you should set this at all.
simon
On Wed, Mar 20, 2013 at 10:48 PM, Scott Smith wrote:
> First, I decided I wasn't comfortable doing closes on
tRAMBufferSizeMB(50.0);
Any help in figuring out what is causing this problem would be appreciated. I
do now have an offline system that I can play with so I can do some intrusive
things if need be.
Scott
-Original Message-
From: Scott Smith [mailto:ssm...@mainstreamdata.com]
Sent:
Subject: RE: Lucene slow performance
Please forceMerge only one time not every time (only to clean up your index)!
If you are doing a reindex already, just fix your close logic as discussed
before.
Scott Smith schrieb:
>Unfortunately, this is a production system which I can't touch (th
nauer [mailto:simon.willna...@gmail.com]
Sent: Friday, March 15, 2013 5:08 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene slow performance
On Sat, Mar 16, 2013 at 12:02 AM, Scott Smith wrote:
> " Do you always close IndexWriter after adding few documents and when
> closing, d
March 16, 2013 12:08 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene slow performance
>
> On Sat, Mar 16, 2013 at 12:02 AM, Scott Smith
> wrote:
> > " Do you always close IndexWriter after adding few documents and
> > when
> closing, disable "
l the time with cancelling
all merges)?
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Scott Smith [mailto:ssm...@mainstreamdata.com]
> Sent: Friday, March 15, 2013 11:15 PM
> To: java-
a
custom merge policy or somthing like this, any special IndexWriter settings?
On Fri, Mar 15, 2013 at 11:15 PM, Scott Smith wrote:
> We have a system that is using lucene and the searches are very slow. The
> number of documents is fairly small (less than 30,000) and each document is
n has changed since 1.4, but does it not
merge all of the various files into a few files?
-Original Message-
From: Scott Smith [mailto:ssm...@mainstreamdata.com]
Sent: Friday, March 15, 2013 4:15 PM
To: java-user@lucene.apache.org
Subject: Lucene slow performance
We have a system th
We have a system that is using lucene and the searches are very slow. The
number of documents is fairly small (less than 30,000) and each document is
typically only 2 to 10 kilo-characters. Yet, searches are taking 15-16 seconds.
One of the things I noticed was that the index directory has sev
Thanks for the suggestions I think Erick is correct as well. I'll let the
customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I
changed the label. Interesting to see where the minimal stemmer and porter
agree (and KStemmer doesn't). You may also find t
d how some common words are stemmed.
-- Jack Krupansky
-Original Message-
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?
Does anyone have any experience with the stemmers? I know that Porter is what
"everyone"
Thanks
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Wednesday, November 14, 2012 12:17 PM
To: java-user@lucene.apache.org
Subject: Re: CJKWidthFilter vs ICUFoldingFilter
On Wed, Nov 14, 2012 at 9:47 AM, Scott Smith wrote:
> Reading the documentation for these
Does anyone have any experience with the stemmers? I know that Porter is what
"everyone" uses. Am I better off with KStemFilter (better performance) or ??
Does anyone understand the differences between the various stemmers and how to
choose one over another?
Reading the documentation for these two filters seems to imply that
CJKWidthFilter is a subset of ICUFoldingFilter. Is that true? I'm basically
using the CjkAnalyzer (from Lucene 4.0) but adding ICUFoldingFilter because I
need umlauts and accent characters removed from any German, French, etc.
ccandless.com]
Sent: Tuesday, November 06, 2012 5:32 AM
To: java-user@lucene.apache.org
Subject: Re: Near Real Time for multiple applications
On Mon, Nov 5, 2012 at 6:33 PM, Scott Smith wrote:
> I've been reading about NRT thinking it might be good to integrate it into my
> code. However,
tags being properly nested.
Cheers
Scott
-----Original Message-
From: Scott Smith [mailto:ssm...@mainstreamdata.com]
Sent: Thursday, November 01, 2012 7:16 PM
To: Michael Sokolov; java-user@lucene.apache.org
Subject: RE: Highlighting html pages
I was trying to play with this. Am I correct in
I've been reading about NRT thinking it might be good to integrate it into my
code. However, I have a question.
Suppose that the index writer and the index reader run in totally different
JVMs (i.e., they are different applications and only communicate via the disk).
Am I correct in thinking
actory.com]
Sent: Tuesday, October 23, 2012 9:04 PM
To: java-user@lucene.apache.org
Cc: Scott Smith
Subject: Re: Highlighting html pages
If you use HTMLStripCharFilter, it extracts the text only, leaving tags out,
and remembering the word positions so that highlighting works properly. Should
do ex
I was doing some tokenizer/filter analysis attempting to fix a bug I have in
highlighting under 4.0. I was running the displayTokensWithFullDetails code
from LIA2. I would get an exception with a bad index value of -1.
I fixed the problem by doing a reset() immediately after creating my
Token
I'm migrating code from Lucene 3.5 to 4.0. I have the following code which is
supposed to highlight text. I get the exception InvalidTokenOffsetsException.
I have no idea what that means. I am using a custom analyzer which seems to
work for searching/indexing, so I assume it will work here (
-user@lucene.apache.org
Subject: Re: Norms and Term Vectors in Lucene 4.0
hey scott,
On Mon, Oct 29, 2012 at 11:56 PM, Scott Smith wrote:
> Converting some code to lucene 4.0, it appears that we can no longer set
> whether we want to store norms or termvectors using the "sug
Converting some code to lucene 4.0, it appears that we can no longer set
whether we want to store norms or termvectors using the "sugared" Field classes
(e.g., StringField() and TextField). I gather the defaults are to store norms
and to not store termvectors?
If I don't want norms on a field,
.html#openIfChanged?
See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged(org.apache.lucene.index.DirectoryReader)
-- Jack Krupansky
-Original Message-
From: Scott Smith
Sent: Friday, October 26, 2012 7:54 PM
To: java-user@lucene.apache.org
Su
2 01:47, Mossaab Bagdouri wrote:
> Lucene document IDs are not stable. You could add a field with an ID
> that you maintain. Your query would then be just a TermQuery on the ID.
>
> Regards,
> Mossaab
>
>
> 2012/10/26 Scott Smith
>
>> I'm currently converting
cument IDs are not stable. You could add a field with an ID
| > that you maintain. Your query would then be just a TermQuery on the
| > ID.
| >
| > Regards,
| > Mossaab
| >
| >
| > 2012/10/26 Scott Smith
| >
| >> I'm currently converting some lucene code to
How do I determine if the index has been modified in 4.0? The ifchanged() and
isChanged() appear to have been removed.
I'm currently converting some lucene code to 4.0. It appears that you are no
longer allowed to delete a document by its ID. Is that correct? Is my only
option to figure some kind of query (which obviously isn't based on ID) and do
the delete from there?
I need to take an html page that I retrieve from my lucene search and
highlight all of the terms that are part of the search. I need to skip over
any html tags since I don't want any words in tags which happen to match the
search to be highlighted.
Note that I don't want sections of the docum
armed up
after a commit and the never ending full GCs.
Greets Ralf
-Ursprüngliche Nachricht-
Von: Scott Smith [mailto:ssm...@mainstreamdata.com]
Gesendet: Montag, 16. Juli 2012 22:29
An: java-user@lucene.apache.org
Betreff: Lucene reorganizing indexes
We have an application that has
We have an application that has to do "real time" indexing of a number of
documents. What it does is wake up about every 20 seconds and updates the
index with any changes that have been queued since the last time it ran. This
involves adding and deleting several hundred documents. This is all
I really need this on Solr, but thought I would start here as I suspect that,
if it's possible, it's some kind of custom relevancy ranking that would need to
be done in lucene and then used in SOLR. I will simplify the actual problem
somewhat, but I think it will have the gist of what I want to
OK. Thanks
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Monday, September 26, 2011 12:15 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes
On Mon, Sep 26, 2011 at 2:06 PM, Scott Smith wrote:
> "is" is the input stream
riginal Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Wednesday, September 21, 2011 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes
On Wed, Sep 21, 2011 at 5:17 PM, Scott Smith wrote:
> I'm updating my lucene code from 3.0 to 3.4. There
Understand. Thanks for the information.
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Wednesday, September 21, 2011 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes
On Wed, Sep 21, 2011 at 5:17 PM, Scott Smith wrote:
>
I'm updating my lucene code from 3.0 to 3.4. There's a change in the MLT
interface I'm confused about. I used the MLT.like(InputStream) method. It now
appears I should change to the MLT.like(InputStreamReader, fieldname) method.
Easy enough to create an InputStreamReader from an InputStream.
One thing to note is that the Stanford POS Tagger is licensed using GPL v2. A
commercial license is available, but it doesn't appear to be free ($3k min if I
read correctly).
I wonder what it would take to make this available using OpenNLP which has a
friendlier license.
-Original Message
mail.com]
Sent: Friday, September 17, 2010 1:03 AM
To: java-user@lucene.apache.org
Subject: Re: QueryParser in 3.x
On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith wrote:
> I recently upgraded to Lucene 3.0 and am seeing some new behavior that I
> don't understand. Perhaps someone can e
I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't
understand. Perhaps someone can explain why.
I have a custom analyzer. Part of the analyzer uses the AsciiFoldingFilter.
If I run a word with an umlaut through that analyzer using the AnalyzerDemo
code in LIA2,
Thanks for looking at this Uwe. I'll check my code again, but I tried changing
it several times and it did seem to make a difference.
Scott
-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Saturday, March 06, 2010 3:11 AM
To: java-user@lucene.apache.org
Subject: R
I've been updating from 2.4.2 to 3.0.1. I had a number of issues (The
Version object in the analyzers was an "interesting" addition-I guess I
don't understand the use case for them. I understand what it says; I was
just surprised and it caused me some problems since I create analyzers
with reflect
I've been looking at the changes I have to make in my code to go from
2.4.1 to 2.9. One of the features I have is to highlight query hits in
documents which meet the search criteria. If the query has a phrase,
then I need to highlight the phrase, but not isolated words from the
phrase which also
gt; wrote:
> On Fri, Jun 19, 2009 at 2:40 PM, Scott
Smith
> wrote:
> > In my environment, one of the concerns is that new documents are
> > constantly being added (and some documents may be deleted). This
means
> > that when a user does a search and pages through results, it
As I read about Filters, it seems to me that a filter is preferred for
any portion of the query string where you are setting the boost to 0
(meaning you don't want it to contribute to the relevancy score).
But, relevancy is only interesting if you are displaying the documents
in relevancy ord
In my environment, one of the concerns is that new documents are
constantly being added (and some documents may be deleted). This means
that when a user does a search and pages through results, it is possible
that there are new items coming in which affect the search-thus changing
where items are
t; -Original Message-
> From: Scott Smith [mailto:ssm...@mainstreamdata.com]
> Sent: Wednesday, June 17, 2009 2:15 AM
> To: java-user@lucene.apache.org
> Subject: Queries and Filters
>
> The last few versions of lucene have deprecated several of the
> interfaces we
Clarification: Obviously, I should have said "June 11" when I talked of a newer
date.
____
From: Scott Smith [mailto:ssm...@mainstreamdata.com]
Sent: Tue 6/16/2009 5:41 PM
To: java-user@lucene.apache.org
Subject: Getting results for a specific date
M
The last few versions of lucene have deprecated several of the
interfaces we were using and this is necessitating a fairly major
upgrade of our code (which hasn't had much done to it for several
years). I'm not complaining; the changes are probably necessary.
In reading LIA2, I've learned abou
().getImplementationVersion();
cheers,
João
On Tue, Jun 16, 2009 at 11:36 PM, Scott Smith wrote:
> Is there any way to programmatically determine the version of lucene
> being loaded?
>
>
>
>
--
Cumprimentos,
João Carlos Galaio da Silva
--
Mostly, our users want to see search results in reverse date order
(newer hits first). I know how to do that with a Sort object and it
works fine.
However, sometimes our users want to do a search and get results in date
order starting at a certain date. Say for example, they want to start
the
Is there any way to programmatically determine the version of lucene
being loaded?
I'm optimizing a database and getting the error:
maxClauseCount is set to 1024
I understand what that means coming out of the query parser, but what
does it mean coming from the optimizer?
Scott
e you could make one filter on A.
>>
>> You could also consider a custom scorer that, added 1,000,000 to
every
>> category A document.
>>
>> How much were you boosting by? What happens if you boost by a very
large
>> factor?
>> As in ridiculously large?
&
I'm interested in comments on the following problem.
I have a set of documents. They fall into 3 categories. Call these
categories A, B, and C. Each document has an indexed, non-tokenized
field called "category" which contains A, B, or C (they are mutually
exclusive categories).
All
ively.
Steve
On 07/18/2008 at 5:03 PM, Scott Smith wrote:
> org.apache.lucene.analysis.cjk.CJKTokenizer is in the
> "contrib" portion of lucene, so I'm not sure if this is the
> right place to mention this or not. I was doing some
> detailed analysis of how this to
org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of
lucene, so I'm not sure if this is the right place to mention this or not. I
was doing some detailed analysis of how this tokenizer worked and noticed the
following behavior (which I would classify as a bug).
If you
a customer, that will be 1 beer...
On Sun, 2008-04-20 at 17:12 -0600, Scott Smith wrote:
> I've written some code to highlight items from a search using the
standard Highlighter class, QueryScorer, and NullFragmenter. Everything
works fine except when we do phrases. If I search for &qu
I've written some code to highlight items from a search using the standard
Highlighter class, QueryScorer, and NullFragmenter. Everything works fine
except when we do phrases. If I search for "fred smith" (with the quotes), it
highlights any instances of "fred smith" just as expected. However
Since what I'm dealing with is well-formed html, I wonder if I could modify the
tokenizer to skip the html elements and then use the NullFragmenter. I can
probably isolate the html text. Sounds like I have a plan or at least
something to try.
Thanks
From: M
. What kind of documents are you indexing?
Matthijs
Scott Smith wrote:
> I've been looking at the highlighter examples. All of them seem to deal with
> fragments. I need to highlight an entire document as it is displayed (i.e.,
> highlight all of the keywords in it). Can someo
I've been looking at the highlighter examples. All of them seem to deal with
fragments. I need to highlight an entire document as it is displayed (i.e.,
highlight all of the keywords in it). Can someone point me to some examples of
this or does the highlighter code not do this?
Thanks
Sco
ument.
I guess that all makes sense, it just means I have to be careful as to
which queries I set the category boost to zero and which I don't.
-Original Message-----
From: Scott Smith [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 12, 2006 3:31 PM
To: java-user@lucene.apache.org
Subj
I've implemented the zero boost solution and it seems to be doing what I
want. Thanks to everyone who had suggestions.
-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Monday, December 11, 2006 11:45 AM
To: java-user@lucene.apache.org
Subject: Re: de-boosting fiel
int .
But again, don't be surprised if one of the more expert folks comes up with
a *much* better idea
Best
Erick
On 12/8/06, Scott Smith <[EMAIL PROTECTED]> wrote:
>
> I have a collection of documents for which I've always returned the
> results sorted on the date
I have a collection of documents for which I've always returned the
results sorted on the date/time of the document (using a sort object in
the search method on my Searcher). It works great.
Suddenly, I have a requirement to return the documents in relevancy
order. So, that's easy (I thought)
Supposed I want to index 500,000 documents (average document size is
4kBs). Let's assume I create a single index and that the index is
static (I'm not going to add any new documents to it). I would guess
the index would be around 2GB.
Now, I do searches against this on a somewhat beefy mach
Interesting and thanks for the answer. I guess I won't write code to
control the order clauses get added--one less thing to do :-)
-Original Message-
From: Doron Cohen [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 20, 2006 6:47 PM
To: java-user@lucene.apache.org
Subject: Re: Performanc
I was reading a book on SQL query tuning. The gist of it was that the
way to get the best performance (fastest execution) out of a SQL select
statement was to "create" execution plans where the most selective term
in the "where" clause is used first, the next most selective term is
used next, etc.
Thanks to everyone who commented. Clearly, I have a lot to think about,
but thanks for the help.
Scott
-Original Message-
From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED]
Sent: Friday, July 07, 2006 2:53 PM
To: java-user@lucene.apache.org
Subject: RE: Managing a large archival (and co
I've been asked to do a project which provides full-text search for a
large database of articles. The expectation is that most of the
articles are fairly small (<2k bytes). There will be an initial
population of around 400,000 articles. There will then be approximately
2000 new articles added ea
I'm building an application which has to provide "real-time" searching
of emails as they come in. I have a number of search strings that I
need to apply against each email as it comes in and then do something
with the email based on which search string(s) get a hit.
My initial thought was to
pressed,indexed>
DOC1: Document stored/uncompressed,indexed>
HITS: 2
DOC0: Document stored/uncompressed,indexed>
DOC1: Document stored/uncompressed,indexed>
See also:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexRead
er.html#isDeleted(int)
Otis
--- Scott Smith <[EM
Suppose I do a search and get a hit list. Before I access the hit list,
my delete routine (running in another thread) comes along and deletes
some documents. What happens if I now try to access documents that have
been deleted?
Scott
I needed to return my hits list in date/time order (instead of
relevancy). So, I implemented a class that converted dates to an int
and stored the integer as a field in my index. I passed a Sort object
to the IndexSearcher (indicating that the sort field was convertible to
int) to get things back
I have the need to create an index which will potentially have a
million+ documents. I know Lucene can accomplish this. However, the
other requirement is that I need to be continually updating it during
the date (adding 1-30 documents/minute). I guess I had thought that I
might try to have an ac
85 matches
Mail list logo