Hi Ganesh,
What you are talking about is loading partial index (as per requirement)
into RAM. This is exactly what any other decently designed application would
do. On the other hand, RAM Directory implementation just copies all of the
index into RAM. Also, tmpfs is nothing but an explicit copy o
FileSystem index reader loads the data to RAM, I have tried with more than 6
GB of index (sharded to 20 index) and the response is pretty fast.
What significance gain would be to use RAM directory.
How the modifications done in RAM directory will sync with FileSystem.
Regards
Ganesh
- Ori
That's indeed an alternative. Moreover, I have heard (not measured/comparered
myself) from people who tried both MM and tmpfs approach that the former has
some overhead.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Anshum
> To: java
: This is perfect, exactly what I was looking for. Thanks much Andrzej!
if you code that up and it works out well, contributing your code as a
Jira attachment could help it become a re-usable tool for others in the
future.
(a simple command line that takes the directory of hte index, a
value
: I'd rather not make SegmentInfos public; it's a large API and we do
: make changes to it as we change the index format. It's also quite
: internal to Lucene.
:
: Making your own MergePolicy/Scheduler is very much an "advanced" use
: case... so I think it's acceptable to have to put it into o.a
It looks like you are reusing a Field (the f.setValue(...) calls); are
you sure you're not changing a Document/Field while another thread is
adding it to the index?
If you can post the full code, then I can try to run it on my
wikipedia dump locally.
Mike
Jason Rutherglen wrote:
> Mike,
>
> It
Mike,
It only happens when at least 1 million documents are indexed in a
multithreaded fashion. Maybe I should post the code? I will try indexing
without the payload field, I assume it won't fail because I indexed
wikipedia before with no issues.
Thanks!
Jason
On Tue, Mar 24, 2009 at 12:25 PM
Using StandardAnalyzer. It's probably the payload field?
This is the code that creates the payload field:
private static class SinglePayloadTokenStream extends TokenStream {
private Token token = new Token(UID_TERM.text(), 0, 0);
private byte[] buffer = new byte[4];
No, I don't hit OOME if I comment out the call to getHTMLTitle. The
heap
behaves perfectly.
I completely agree with you, the thread count goes haywire the
moment I call
the HTMLParser.getTitle(). I have seen a thread count of like 600
before my
I hit OOME (with the getTitle() call on) and
Thank you for your help Michael. I've solved the problem by new creation of the
index.
The OutOfErrorException killed the thread, which was responsible for index
maintenance.
So the index recreation failed without an error message. So after recreating
the index,
the problem is solved.
Sorry for
I was just able to index all of wikipedia, using StandardAnalyzer,
with assertions enabled, without hitting that exception. Which
analyzer are you using (besides your payload field)?
Mike
Michael McCandless wrote:
> H.
>
> Jason is this easily/compactly repeated? EG, try to index the N doc
I'd rather not make SegmentInfos public; it's a large API and we do
make changes to it as we change the index format. It's also quite
internal to Lucene.
Making your own MergePolicy/Scheduler is very much an "advanced" use
case... so I think it's acceptable to have to put it into o.a.l.index
pack
H.
Jason is this easily/compactly repeated? EG, try to index the N docs
before that one.
If you remove the SinglePayloadTokenStream field, does the exception
still happen?
Mike
Jason Rutherglen wrote:
> While indexing using
> contrib/org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker
I'm overriding MergePolicy which is public, however SegmentInfos is package
protected which means the MergePolicy subclass must be in the
org.apache.lucene.index package. Can we make SegmentInfos public?
Actually, I was hoping you could try leaving the getHTML calls in, but
increase the heap size of your Tomcat instance.
Ie, to be sure there really is a leak vs you're just not giving the
JRE enough memory.
I do like your hypothesis, but looking at HTMLParser it seems like the
thread should exit a
Highly appreciate your replies Michael.
No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap
behaves perfectly.
I completely agree with you, the thread count goes haywire the moment I call
the HTMLParser.getTitle(). I have seen a thread count of like 600 before my
I hit OOME
Odd. I don't know of any memory leaks w/ the demo HTMLParser, hmm
though it's doing some fairly scary stuff in its getReader() method.
EG it spawns a new thread every time you run it. And, it's parsing
the entire HTML document even though you only want the title.
You may want to switch to better
While indexing using
contrib/org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker. The
asserion error is from TermsHashPerField.comparePostings(RawPostingList p1,
RawPostingList p2). A Payload is added to the document representing a UID.
Only 1-2 out of 1 million documents indexed generates th
After some more researching I discovered that the following code snippet
seems to be the culprit. I have to call this to get the "title" of the
indexed html page. And this is called 10 times as my I display 10 results on
a page.
Any Suggestions on how to achieve this without the OOME issue.
Hello all,
Our application involves a high index write rate - anywhere from a few
dozen to many thousands of docs per sec. The write rate is frequently
higher than the read rate (though not always), and our index must be
as fresh as possible (we'd like search results to be no more than a
couple o
Seid Mohammed wrote:
ok, but I need to know how to proceed with it.
I mean how to include to my application
many thanks
Seid M
You may want to look at the following articles:
http://lucene.jugem.jp/?eid=133
http://lucene.jugem.jp/?eid=134
articles are in Japanese, but ignore them. :)
Pro
I have been able to successfully index and search text from structured
documents like PDF and MS Word. I am having a real hard time trying to
figure out how to group the index strings together e.g. if my document had a
question and answer in a table, the search will produce the text with the
quest
There is even an old thread about this on the Mahout-users list:
http://markmail.org/message/ludu5hjfczuvgk3n
17 mar 2009 kl. 15.17 skrev Grant Ingersoll:
Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout
. In there is the Taste collaborative filtering project
ok, but I need to know how to proceed with it.
I mean how to include to my application
many thanks
Seid M
On 3/24/09, Koji Sekiguchi wrote:
> Seid Mohammed wrote:
>> Hi All
>> I want my lucene to index documents and making some terms to have more
>> boost value.
>> so, if I index the document "
Seid Mohammed wrote:
Hi All
I want my lucene to index documents and making some terms to have more
boost value.
so, if I index the document "The quick fox jumps over the lazy dog"
and I want the term fox and dog to have greater boost value.
How can I do that
Thanks a lot
seid M
How about
Hi All
I want my lucene to index documents and making some terms to have more
boost value.
so, if I index the document "The quick fox jumps over the lazy dog"
and I want the term fox and dog to have greater boost value.
How can I do that
Thanks a lot
seid M
--
"RABI ZIDNI ILMA"
---
When I run checkIndex on your index, I see a new exception:
org.apache.lucene.index.CorruptIndexException: Incompatible format
version: 119865344 expected 1 or lower
at org.apache.lucene.index.FieldsReader.(FieldsReader.java:116)
at
org.apache.lucene.index.SegmentReader.initialize
Here's my first approach but I note that, typically, I have fields
(which are not stored) which may be the matching field but still not
be the one I want to return.
Typically, I have a field "names in all languages along the standard-
analyzer" which is not the one I want to "see as matched".
Instead of ignoring the exceptions in your finally clause, can you log
them? It could be something interesting is happening in there...
I'll have a look at the index.
Mike
"René Zöpnek" wrote:
> Thanks for your answer, Mike.
>
> Unfortunately I have no direct access to the server with the corr
Hi Paul,
Going by what you've conveyed here, I'd assume that you have more than some
data. You could either go ahead with Ian's way which is the suggested one(as
far as lucene implementation is concerned) but It'd not be possible if
you're index is greater than 2 Gigs and you are not running the 6
Ian Lea wrote:
Hi
You can load an existing index into a RAMDirectory using one of the
constructors that takes an existing index. I believe that a RAM index
will be the same size as a file based index.
Of course I was looking at IndexSearcher but the constructor is for
RAMDirectory
MMapDir
Hi
You can load an existing index into a RAMDirectory using one of the
constructors that takes an existing index. I believe that a RAM index
will be the same size as a file based index.
MMapDirectory is another possibility.
--
Ian.
On Tue, Mar 24, 2009 at 8:42 AM, Paul Taylor wrote:
> Hi
Hi
Ive built some file based indexes based on data in a database, and it
took quite some time.
I am interested in trying to use RAM based indexes instead of file based
indexes to compare search performance but its going to take some time to
rebuild the index from the original database, isnt it
Do you have any info that helps you narrow down how many to choose,
like some type of ranking of the synonyms? I guess I would start
smaller, say maybe 3, and then evaluate your results with different
numbers.
On Mar 22, 2009, at 2:40 PM, liat oren wrote:
Ok, thanks. I will look how to u
Thanks for your answer, Mike.
Unfortunately I have no direct access to the server with the corrupt index. So
changing the creation process of the index is not possible.
I've uploaded the index to http://drop.io/hlu53sl (9 MB).
Here is the code for creating the index:
public static void crea
35 matches
Mail list logo