> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo
> indicates it should do.
I am not sure about the "should" here - the way I see it, this
is just how the demo works: Snowball stemmers operate on words,
so the demo first breaks the input text into words and only
then applies stemmin
Thanks. If the extra memory allocated is native memory I don't think
jconsole includes it in "non-heap" as it doesn't show this as
increasing, and jmap/jhat just dump/analyse the heap. Do you know of an
application that can report native memory usage?
Thanks,
Steve
Doron Cohen wrote:
Stephen
Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01:
> One interesting thing is that although the memory allocated as
> reported by the processes tab of Windows Task Manager goes up and up,
> and the JVM eventually crashes with an OutOfMemory error, the total size
> of heap + non-heap as
Hi Otis,
Thanks very much for your reply.
I've removed the LuceneIndexAccessor code, and still have the same
problem, so that at least rules out LuceneIndexAccessor as the source.
maxBufferedDocs is just set to the default, which I believe is 10.
I've tried jconsole, + jmap/jhat for looking
On 16-May-07, at 11:00 PM, Doron Cohen wrote:
If you enter a.b.c.d.e.f.g.h to that demo you'll see that
the demo simply breaks the input text on '.' - that has
nothing to do with filenames.
That is not what I am seeing from my testing:
a.b.c.d.e.f.g.h is not broken apart like how the snowbal
I found a similar recommendation about the disc access and reading in order
in the following message and implemented this in my code:
http://www.gossamer-threads.com/lists/lucene/general/28268#28268
Since I am dealing with multiple index directories I sorted the document
references by index numbe
Thank you, Erick, this is very useful!
Have you ever taken a look at Google Suggest[1]? It's very fast, and the
results are impressive. I think your suggestion will go a long way to
fixing my problem, but there's probably still quite a gap between this
approach and the kind of results that Google
Hi Steve,
You said the OOM happens only when you are indexing. You don't need
LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not
part of Lucene core. What is your maxBufferedDocs set to? And since you are
using JVM 1.6, check out jmap, jconsole & friends, they'll
- Original Message
From: Paul Elschot <[EMAIL PROTECTED]>
On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter with
> Lucene document reads.
>
> We have amongst other fields one field (default) storing all searchabl
Scoring cannot be turned off, currently. I once thought it is possible to skip
scoring with the patch in LUCENE-584 JIRA issue, but I was wrong.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message -
Oops. I do indeed have omitNorms turned on. I will re-read the
documentation on it and look at turning it off.
Sorry for the bother. :/
On 5/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: Terminator 2
: Terminator 2: Judgment Day
:
: And I score them against the query +title:(Terminator
: Terminator 2
: Terminator 2: Judgment Day
:
: And I score them against the query +title:(Terminator 2)
: Would there be some method or combination of methods in Similarity
: that I could easily override to allow me to penalize the second item
: because it had "unused terms"?
that's what the De
If I have two items in an index:
Terminator 2
Terminator 2: Judgment Day
And I score them against the query +title:(Terminator 2)
they come up with the same score (which makes sense, it just isn't
quite what I want)
Would there be some method or combination of methods in Similarity
that I could
On 17-May-07, at 6:43 AM, Andreas Guther wrote:
I am actually using the FieldSelector and unless I did something
wrong it
did not provide me any load performance improvements which was
surprising to
me and disappointing at the same time. The only difference I could
see was
when I returned
: A particular document can have several date windows.
: Give a specific date, only return those documents where that date
: falls within at least one of those windows.
: Also, note that there are multiple windows here for a single
: document, we can't just search between min start and max end.
T
These is a parser for open office in Nutch. It is a plugin called parse-oo.
You can find more information in the nutch mailing lists.
On 5/17/07, jim shirreffs <[EMAIL PROTECTED]> wrote:
Anyone know how to add OpenOffice document to a Lucene index? Is there a
parser for OpenOffice?
thanks in
h. Now that I re-read your first mail, something else
suggests itself. You stated:
"We have amongst other fields one field (default) storing all searchable
fields".
Do you need to store this field at all? You can search fields that are
indexed but NOT stored. I've used something of the same
Anyone know how to add OpenOffice document to a Lucene index? Is there a
parser for OpenOffice?
thanks in advance
jim s.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I am actually using the FieldSelector and unless I did something wrong it
did not provide me any load performance improvements which was surprising to
me and disappointing at the same time. The only difference I could see was
when I returned for all fields a NO_LOAD which from my understanding is
Some time ago I posted the results in my peculiar app of using
FieldSelector, and it gave dramatic improvements in my case (a
factor of about 10). I suspect much of that was peculiar to my
index design, so your mileage may vary.
See a thread titled...
*Lucene 2.1, using FieldSelector speeds up
There has been significant discussion on this topic (way more than
I can remember clearly) on the mail thread, but as I remember it's
been referred to as "facet" or "faceted". I think you would get a lot
of info searching for these terms at...
http://www.gossamer-threads.com/lists/lucene/java-use
Hi,
I have two different use-cases for my queries. For the first,
performance is not too critical
and I want to sort the results by relevance (score). The second however,
is performance critical,
but the score for each result is not interesting. I guess, if it was
possible to disable scoring
for
You can get it from a Hits object (see the id() method) or you can
iterate over the docs from 0 to maxDoc -1 (skipping deleted docs)
I have some code at http://www.cnlp.org/apachecon2005/ that shows
various usages for Term Vector. The Lucene in Action book has some
good examples as well.
I haven't tried compression either. I know there was some talk a
while ago about deprecating, but that hasn't happened. The current
implementation yields the highest level of compression. You might
find better results by compressing in your application and storing as
a binary field, thus
Hi lucener:
I am want get the TermFreqVector 。but I must get docNum first.
titleVector = reader.getTermFreqVector(docNum, "title");
but I can’t get Docnum by lucene Document.
how can I get the docNum use Document object?
Like this getTermFreqVector(doc,”title”);
xiaojun tong
010-64
Hi All,
I was wondering - is it possible to search and group the results by a
given field?
For example, I have an index with several million records. Most of
them are different Features of the same ID.
I'd love to be able to do.. groupby=ID or something like that
in the results, and provide the
On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter with
> Lucene document reads.
>
> We have amongst other fields one field (default) storing all searchable
> fields. This field can become of considerable size since we are
Hi All,
I've been thinking about this problem for some time now. I'm trying
to figure out a way to store date windows in lucene so that I can
easily filter as follows.
A particular document can have several date windows.
Give a specific date, only return those documents where that date
fa
28 matches
Mail list logo