No, I don't hit OOME if I comment out the call to getHTMLTitle. The
heap
behaves perfectly.
I completely agree with you, the thread count goes haywire the
moment I call
the HTMLParser.getTitle(). I have seen a thread count of like 600
before my
I hit OOME (with the getTitle() call on) and
I'm just wondering if anyone can share with us their learnings on
optimizing their storage configurations for relatively large indexes
(millions of documents, 10+Gb in size). Is there a 'suggested best'
Stripe size for RAID-10 configurations?
I did some Googling, and surprised I couldn't f
I don't believe our large users to have enough memory for Lucene
indexes to fit in RAM. (Especially given we use quite a bit of RAM
for other stuff.) I think we also close readers pretty frequently
(whenever any user updates a JIRA issue, which I am assuming
happening nearly constantly
On 23/10/2008, at 4:20 PM, Ganesh wrote:
My Index DB is having 10 million records and it will grow to 30
million. Currently I am using millisecond timestamp and the RAM
cosumption is more. I will change the resolution to minute. I am
using 2 searcher objects refreshing each other every min
On 15/10/2008, at 7:37 AM, Chris Gilliam wrote:
Hello Everyone,
New to Lucene..
We currently roughly 100Gig of log files. We are needing to build
a search
application that can return rows of data from the files and combine
the
results?
Does Lucene index the content in the files?
Will i
en once you get past the synchronization
bottleneck in the CollationKey stuff).
cheers,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
duce the Terms in memory, but I have not seen how to set
this value for Lucene.
Any help would be greatly appreciated.
Rich
Paul Smith
Core Engineering Manager
Aconex
The easy way to save time and money on your project
696 Bourke Street, Melbourne,
VIC 3000, Australia
Tel: +61 3 9240 0200
are you using Locale-sensitive sorting at all?
https://issues.apache.org/jira/browse/LUCENE-806
Just wondering if you're seeing the same problem we are having.
cheers,
Paul Smith
On 19/02/2007, at 8:52 AM, dmitri wrote:
We have search (no update) web app on 2 dual core CPU machin
is synchronized. I wonder if
a ThreadLocal based collator would be better here... ? There doesn't
appear to be a reason for other threads searching the same index to
wait on this sort. Be just as easy to use their own. (Is
RuleBasedCollator a "heavy" object memory wise? Wou
y 3.75 US cents).
Paul Smith
smime.p7s
Description: S/MIME cryptographic signature
That is neat... nice work.
On 02/03/2006, at 10:23 AM, Larry Ogrodnek wrote:
Hey, I put together a little ajax / lucene javadoc lookup site that I
just wanted to share I've found it pretty useful to be able to
just
type a few letters instead of navigating through the standard javadoc
fr
or deletion. Waiting the amount of time for the IndexSearcher to
close sees the file descriptor released.
Sorry for the intrusion.
cheers,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On 14/02/2006, at 7:44 AM, Doug Cutting wrote:
Paul Smith wrote:
We're using Lucene 1.4.3, and after hunting around in the source
code just to see what I might be missing, I came across this, and
I'd just like some comments.
Please try using a 1.9 build to see if this is
assumably, close the file.
The guard here is that the finalizer method in FSInputStream does
call close() so that would well explain the releasing of file handles
at garbage collection intervals.
Why would CompoundFileReader not need to call .close()?
Am I going mad here and just seeing ghosts? Comments appreciated.
Paul Smith
his all the time in a Tomcat app server box, where each Http
Connector is a thread, and appears as it's own process.
cheers,
Paul Smith
On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote:
Hi all,
Is anyone experiencing possible memory problems on LINUX with
Lucene search? Here is o
one thing you may not have thought about yet that may affect your
decision: sorting in lucene requires the field be indexed but
untokenized.
so if you want to support sortting on the conceptual "title",
you'll still
need a version of your title field that's untokenized, which can
then be
u
query
trick works if it searches on
title:"0start0 auto*"
but does not find any matches for
title:"0start0 aut*"
I'm a bit stuck.
Paul
On 06/01/2006, at 10:43 AM, Paul Smith wrote:
2) index a magic token at the start of the title and include tha
2) index a magic token at the start of the title and include that in a
phrase query:
"_START_ the quick"
Ok, I've gone and chose "0start0" as my start token, because our
analyzer is stripping _.
Now, second dumb question of the day, give the search for starts with
"The qui*", that has t
1) also index the field untokenized and use a straight prefix query
See my reply to Chris, not sure I can afford the index size increment.
2) index a magic token at the start of the title and include that in a
phrase query:
"_START_ the quick"
h, that's clever.
3) use a SpanFirst quer
On 06/01/2006, at 9:33 AM, Chris Hostetter wrote:
: Think SQL of " where title like 'The quick%' ".
I solved this problem by having a variation of my field that was not
tokenized, and did PrefixQueries on that field (so in your case, leave
your title field alone for generic matches, and
How do I do that with Lucene? I'm sure this a is a dumb question,
and I know that Lucene's searching is way more useful than that, but
you know these pesky compatibility requirements.It's screwing
with my unit tests
he user has probably navigated away and given up on the long
running search anyway).
Paul Smith
On 18/11/2005, at 6:57 PM, Matt Magoffin wrote:
I'm updating nearly continuously (probably average about every 10
seconds). I don't explicitly close the IndexSearcher objects I
create, as
I sha
This would be a good candidate for an IllegalStateException to be
thrown if the user calls this method when it's not valid. Save the
user some hassles? (one can JavaDoc to one is blue in the face, but
throwing a good RuntimeException with a message trains the users much
quicker... :) )
P
On 15/07/2005, at 3:57 PM, Otis Gospodnetic wrote:
The problem that I saw (from your email only) with the "ship the full
little index to the Queen" approach is that, from what I understand,
you eventually do addIndexes(Directory[]) in there, and as this
optimizes things in the end, this means y
answering my own question: nutch.org -> lucene.apache.org/nutch/
Excellent!
Paul
On 15/07/2005, at 11:45 AM, Paul Smith wrote:
Cl, I should go have a look at that.. That begs another
question though, where does Nutch stand in terms of the ASF? Did I
read (or dream) that Nutch may
, Paul Smith wrote:
My punt was that having workers create sub-indexs (creating the
documents and making a partial index) and shipping the partial
index back to the queen to merge may be more efficient. It's
probably not, I was just using the day as a chance to see if it
looked prom
m in the same index? Maybe you need those individual,
smaller indices to be separate
How do you deal with the possibility of the same Document being
present
in multiple indices?
Otis
--- Paul Smith <[EMAIL PROTECTED]> wrote:
I had a crack at whipping up something along this li
ghput problem on that too).
Would love to see something like this work really well, and perhaps
generalize it a bit more. I do like the simplicity of the SEDA
principles.
cheers,
Paul Smith
On 14/07/2005, at 11:50 PM, Peter Gelderbloem wrote:
I am currently looking into building a
On 13/07/2005, at 1:34 AM, Chris Hostetter wrote:
: Since this isn't in production yet, I'd rather be proven wrong now
: rather than later! :)
it sounds like what you're doing makes a lot of sense given your
situation, and the nature of your data.
the one thing you might not have concidered
Many thanks for confirming the principles should work fine. It is a
load off my mind! :)
On index update, a small Event is triggered into a Buffer, that is
periodically (every 30 seconds) processed to coalesce them, then
ensure that any open IndexSearcher in the cache is closed.
On 12/07
On 11/07/2005, at 10:43 AM, Chris Hostetter wrote:
: > Generally speaking, you only ever need one active Searcher, which
: > all of
: > your threads should be able to use. (Of course, Nathan says that
: > in his
: > code base, doing this causes his JVM to freeze up, but I've
never seen
: >
On 11/07/2005, at 9:15 AM, Chris Hostetter wrote:
: Nathan's point about pooling Searchers is something that we also
: addressed by a LRU cache mechanism. In testing we also found that
Generally speaking, you only ever need one active Searcher, which
all of
your threads should be able to u
omatically closed?
Appreciate any thoughts on this. I'd rather know now while I have
the opportunity to change the design than later when in production.. :)
cheers,
Paul Smith
On 09/07/2005, at 5:39 AM, Otis Gospodnetic wrote:
Nathan,
3) is the recommended usage.
Your index is on an
On 27/06/2005, at 7:14 PM, Nader Henein wrote:
I implemented a JMS based solution about a year ago because I
thought it would solve my atomicity problem and give me a
centralized way of indexing, you'll have to use the pluggable
persistence (if you use ActiveMQ) to be able to recover from
If you use ActiveMQ for JMS, you can take advantage of it's
Composite Destination feature and have a virtual Queue/Topic that
is actually several Queues/Topics. This is what we use to keep a
mirror index server completely in sync. The application sends an
update message to a queue
hout the main application knowing
anything about it.
Paul Smith
On 26/06/2005, at 2:35 AM, Stephane Bailliez wrote:
I have been browsing the archives concerning this particular topic.
I'm in the same boat and the customer has clustering requirements.
To give some background:
I ha
Indexing every multi-word synonym as a single token would introduce
spaces into the tokens. In that case searching for (java) would not
match "i love jsp and tomcat". I think that searching for (java*) would
match.
Rewriting the query is also problematic. If you search for (java
server), you don't
Thanks for your help guys!
If you put the term query at position 2 then you need slop to find "Use
PowerQuery for advanced searches", which is the exact text in the
document. I think I'd rather have that phrase query work without any
slop, and require some slop for "use power query for advanced
I am writing a document management system for my company, and many of
our feature names are in Hungarian notation (PowerQuery,
TransactionManager, etc.). This can make it hard to find some things
with a default analyzer.
I'd like to be able to index text like "Use PowerQuery for advanced
searches"
your application too, which is very useful for a single instance, and
can be easily broken out to be used in a clustered environment.
cheers,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [
.
my question is, is there any performance concerns here if
("...In(g,h,i,j,) ") starts getting longer and longer? Can Lucene
handle this in an optimal manner, without a serious scalability issue ?
(memory/cpu/io etc). Or would it be better that a different design is
used gor th
41 matches
Mail list logo