I experimented with it, but somehow (I am not convinced why) I got poorer
indexing performance with higher RAM. That was an initial experiment and I did
not dig into it. But, for time being, I have acceptable indexing speed so I am
only focusing on reducing search time.
Thanks and Regards,
She
So, you didn't really use the setRamBuffer.. ?
Any reasons for that?
--
Anshum Gupta
http://ai-cafe.blogspot.com
On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh wrote:
> My final settings are:
> 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop
> 2. 100GB disk space.
> 3.
My final settings are:
1. 1.5 gig RAM to the jvm out of 2GB available for my desktop
2. 100GB disk space.
3. Index creation and searching tuning factors:
a. mergeFactor = 10
b. maxFieldLength = 10
c. maxMergeDocs = 500
d. full optimize at end of i
Hello all,
I am getting the following exception for one of my customer. I think the
database is corrupted but want to know the exact cause.
Exception: read past EOF
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)
I am using Compass which uses Lucene 2.4.1.
We are trying to implement "find as you type" searching. So for example, if I
have documents containing the following:
Newfoundland
New Mexico
New York
New Yorkshire
If a user types "New" all four documents are returned. When the user adds a
space, onl
Would it matter if an IndexReader was opened while an index merge is in
progress?
Thanks
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, August 10, 2010 12:03 PM
To: java-user@lucene.apache.org
Subject: Re: Index merge question
When you open an In
When you open an IndexReader, Lucene effectively takes
a "snapshot" of the index and searches it until you reopen
your reader. So the timing of when the merged index gets
used is up to you, so you should be fine.
Best
Erick
On Tue, Aug 10, 2010 at 11:28 AM, wrote:
> Hello,
>
> Is there any poin
Hello,
Is there any point during a merge operation where the index cannot be
searched or is unstable? We want to create a bunch of smaller indexes in
parallel and then merge them into a single index that may have searches
running against it.
Thanks,
Ian Koelliker
Shelly,
Do you mind sharing with the list the final settings you used for your best
results?
Cheers,
Pablo
On Tue, Aug 10, 2010 at 3:49 PM, anshum.gu...@naukri.com
wrote:
> Hey Shelly,
> If you want to get more info on lucene, I'd recommend you get a copy of
> lucene in action 2nd Ed. It'll help
Hey Shelly,
If you want to get more info on lucene, I'd recommend you get a copy of lucene
in action 2nd Ed. It'll help you get a hang of a lot of things! :)
--
Anshum
http://blog.anshumgupta.net
Sent from BlackBerry®
-Original Message-
From: Shelly_Singh
Date: Tue, 10 Aug 2010 19:11:1
Hi folks,
Thanks for the excellent support n guidance on my very first day on this
mailing list...
At end of day, I have very optimistic results. 100bln search in less than 1ms
and the index creation time is not huge either ( close to 15 minutes).
I am now hitting the 1bln mark with roughly the
Hi All,
Some very promising findings... for 100 mln ( a factor of 10 less than my
goal), I could bring the search speed to 'single-digit' mili seconds. The major
change is that I am now optimizing the index, which I was shying from doing
earlier.
For fun, I am planning to take a reading of 1 b
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-somethingyou get the point" it
will go to all shards so the whole point of shards will be
compromised...you'll have 26 billion documents index ;)
Looks like the only way is to search all shards.
D
Hmm..I get the point. But, in my application, the document is basically a
descriptive name of a particular thing. The user will search by name (or part
of name) and I need to pull out all info pointed to by that name. This info is
externalized in a db.
One option I can think of is-
I can shard
You might want to take a look at RemoteSearchable (
http://lucene.apache.org/java/2_9_2/api/contrib-remote/org/apache/lucene/search/RemoteSearchable.html)
-- it'll be helpful if you place shards on different servers.
On Tue, Aug 10, 2010 at 6:08 PM, Shelly_Singh wrote:
> - shard into 10 indices.
I'd second that.
It doesn't have to be date for sharding. Maybe every query has some
specific field, like UserId or something, so you can redirect to
specific shard instead of hitting all 10 indices.
You have to have some kind of narrowing: searching 1bn documents with
queries that may hit all do
- shard into 10 indices. If you need the concept of a date range search, I
would assign the documents to the shard by date, otherwise random assignment is
fine.
Shelly - Actually my documents are originally database records with each being
equally important.
- have a pool of IndexSearchers for
I do not see a way to optimally decide how to shard the data. Its very
difficult for my purpose; and so the safe bet is to assume that all indices
will need to be searched.
Okay, I can try ParallelMultiSearcher in addition to MultiSearcher.
-Original Message-
From: Anshum [mailto:ansh.
Shelly:
You wouldn't necessarily have to use a multisearcher. A suggested alternative
is:
- shard into 10 indices. If you need the concept of a date range search, I
would assign the documents to the shard by date, otherwise random assignment is
fine.
- have a pool of IndexSearchers for each in
Searching on all indices shouldn't be that bad an idea instead of searching
a single huge index, specially considering you have a constraint on the
usable memory.
You could use a ParallelMultiSearcher which spawns threads to query across
indexes and merges the results.
What I asked was, is there a
No sort. I will need relevance based on TF. If I shard, I will have to search
in al indices.
-Original Message-
From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
Sent: Tuesday, August 10, 2010 1:54 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Would
Hi Shelly,
You need to reduce your maxMergeDocs. set ramBufferSizeMB to 100,
which will help you to use less RAM in indexing.
>>>search time is 15 secs..
How you are calculating this time. Just taking time difference before
and after the search method or this involves time to parse the
document o
Correction: mergeFactor determines how many segments are merged at once.
It's IndexWriter's ramBufferSizeMB and/or maxBufferedDocs that
determine how many docs are buffered in RAM before a new segment is
flushed.
A higher mergeFactor will require more RAM during merging, will cause
longer running
Would like to know, are you using a particular type of sort? Do you need to
sort on relevance? Can you shard and restrict your search to a limited set of
indexes functionally?
--
Anshum
http://blog.anshumgupta.net
Sent from BlackBerry®
-Original Message-
From: Shelly_Singh
Date: Tue,
Hi Danil,
I get ur point. Infact, the latest readings I have for 1bln docs is also
asserting the same thing.
Index creation time is 2 hours.. which is fine by me... but search time is 15
secs.. which is too high for any application.
I am planning to do a sharding of indices and then use a multi
Hi Anshum,
I am already running with the 'setCompoundFile' option off.
And thanks for pointing out mergeFactor. I had tried a higher mergeFactor
couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM
was because maxMergeDocs was unlimited and I was using MMap. U r rigt,
The problem actually won't be the indexing part.
Searching such large dataset will require a LOT of memory.
If you'll need sorting or faceting on one of the fields, jvm will explode ;)
Also GC times on large jvm heap are pretty disturbing (if you care
about your search performance)
So I'd advise
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your
mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
memory before writing it to a file (and incurring I/O). You could actually
flush by RAM usage instead of a Doc count. Turn off using the Co
28 matches
Mail list logo